Search giant Google has revealed its latest advancements in robots with Robotics Transformer 2 (RT-2), a vision-language-action (VLA) model that’s trained on text and images from the web.
In its mission to bring people, ‘Closer to a future of helpful robots’, Google’s RT-2 can directly generate robotic actions. The company says that similar to how language models are trained on text from the internet, RT-2 also transfers knowledge from web data to inform robot behaviour.
“For decades, when people have imagined the distant future, they’ve almost always included a starring role for robots. Robots have been cast as dependable, helpful and even charming. Yet across those same decades, the technology has remained elusive — stuck in the imagined realm of science fiction,” said Google in a post.
Google says RT-2 can ‘speak robot’ as well. Adding that, unlike chatbots, robots need ‘grounding’ in the real world and their abilities. The company says one of the challenges with robots is that, “Their training isn’t just about, say, learning everything there is to know about an apple: how it grows, its physical properties, or even that one purportedly landed on Sir Isaac Newton’s head.”
“A robot needs to be able to recognise an apple in context, distinguish it from a red ball, understand what it looks like, and most importantly, know how to pick it up,” says the Bard AI developer. “Learning is a challenging endeavour, and even more so for robots.”
Google’s novel approach to RT-2
According to the Gmail company, recent advancements have enhanced robots’ reasoning capabilities, enabling them to utilise chain-of-thought prompting to analyse multi-step problems. Also, the integration of vision models like PaLM-E, has improved robots’ understanding of their surroundings.
“RT-1 showed that Transformers, known for their ability to generalise information across systems, could even help different types of robots learn from each other.
“But until now, robots ran on complex stacks of systems, with high-level reasoning and low-level manipulation systems playing an imperfect game of telephone to operate the robot. Imagine thinking about what you want to do, and then having to tell those actions to the rest of your body to get it to move,” says Google.
The company adds that RT-2 simplifies the process and allows a single model to handle complex reasoning similar to foundation models while also producing robot actions. Most importantly, it demonstrates that even with minimal robot training data, the system can transfer knowledge from its language and vision training data to perform robot actions, even for tasks it hasn’t been explicitly trained on.
Plans with RT-2
Google says after more than 6,000 robotic tests, the team found that RT-2 performed as well as RT-1 on tasks it was trained for. However, it did much better on new tasks that it hadn’t encountered before, reaching 62% accuracy compared to RT-1’s 32%.
“In other words, with RT-2, robots are able to learn more like we do — transferring learned concepts to new situations.”
RT-2 demonstrates the integration of AI advancements into robotics and holds great potential for more versatile and general-purpose robots, says Google. It also adds that while there’s tremendous work ahead to create useful robots for human-centred environments, “RT-2 shows us an exciting future for robotics just within grasp.”
Isa Muhammad is a writer and video game journalist covering many aspects of entertainment media including the film industry. He's steadily writing his way to the sharp end of journalism and enjoys staying informed. If he's not reading, playing video games or catching up on his favourite TV series, then he's probably writing about them.