In a bustling open-plan office in Mountain View, California, an impressive wheeled robot has been taking on the role of tour guide and office assistant with the help of a new language model upgrade from Google DeepMind. The robot, equipped with the latest version of Google’s Gemini large language model, can now understand commands and navigate its way through the office space.
For example, if a person tells the robot to “Find me somewhere to write,” it will obediently roll off to locate a clean whiteboard in the building. Gemini’s advanced capabilities in handling video and text, along with its ability to process large amounts of information from previous video tours of the office, enable the robot to interpret its surroundings and respond accurately to commands that involve common sense reasoning. By combining Gemini with an algorithm that generates specific actions for the robot based on commands and visual input, the robot can perform tasks like turning in response to its environment.
When Gemini was first introduced in December, Google DeepMind’s CEO, Demis Hassabis, expressed that the model’s multimodal features could unlock new possibilities for robots. He mentioned that the company’s researchers were actively exploring the potential of the model in robotics. In a recent paper detailing the project, the researchers reported that the robot demonstrated up to 90% reliability in navigation, even with complex commands like “Where did I leave my coaster?” The DeepMind team noted that their system has greatly improved human-robot interaction naturalness and usability.
The successful demonstration of Google’s robot showcases the transformative power of large language models in performing practical tasks in the physical world. While most chatbots typically operate within digital interfaces, advancements like Gemini are expanding their capabilities to incorporate visual and auditory inputs. In a recent showcase, Hassabis revealed an upgraded version of Gemini that could understand office layouts through a smartphone camera.
Both academic and industry research communities are actively exploring how language models can enhance robots’ functionality. The International Conference on Robotics and Automation, a prominent event for robotics researchers, features numerous papers on the integration of vision language models in robotics. Investors are also recognizing the potential of AI-driven advancements in robotics, with startups like Physical Intelligence and Skild AI securing substantial funding to develop robots with problem-solving abilities.
The integration of large language models like Gemini has revolutionized the way robots interact with their environment. By training models on images and videos in addition to text, robots can now respond to questions that require perception. Google’s robot, empowered by Gemini, can understand visual instructions along with verbal commands, enabling it to follow directions drawn on a whiteboard to reach a specified destination.
Looking ahead, the researchers plan to further test the system on different types of robots and enhance its capability to handle more complex inquiries. With the potential to answer questions like “Do they have my favorite drink today?” based on visual cues, Gemini and similar language models are paving the way for a new era of intelligent and responsive robotics.