Language Models Bridge the Gap Between Robots and the Real World
Related Articles

Robots today largely exist in controlled industrial environments, programmed for narrow tasks. This makes them incapable of adapting to the unpredictability of the real world. Helping robots understand natural language and accumulate real world knowledge could enable more capable, helpful robots. Recent advances combining large language models with robot learning are bringing us closer to this goal.
Blending computer vision and language models
Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a system called Feature Fields for Robotic Manipulation (F3RM) that combines computer vision with large language models to help robots identify and grasp unfamiliar objects based on open-ended natural language prompts.
F3RM works by taking 50 images from multiple angles and using a neural radiance field (NeRF) approach to convert the 2D images into a detailed 3D scene representation. It then overlays this with semantic features from CLIP, a vision model trained on hundreds of millions of images to recognise visual concepts. This creates a feature field that combines RGB images with semantic features to enable geometry and semantics understanding.
With just a few demonstrations, F3RM can apply its spatial and conceptual knowledge to manipulate new objects it’s never seen before, as requested through natural language prompts. For example, when prompted to “pick up Baymax”, a robot character, F3RM successfully identified and grasped a toy it had not been trained on, using the descriptive prompt and its learned knowledge.
The researchers note this approach could be useful in warehouses, households and other dynamic real-world environments with many unpredictable objects robots need to interact with.
Infusing robotics with world knowledge
Researchers from Google and Everyday Robots have also shown that integrating large language models can enhance robot performance. Their system, PaLM-SayCan, implements Google’s Pathways Language Model (PaLM), which contains 137 billion parameters, on a real Everyday Robots mobile manipulator equipped with an arm and gripper.
PaLM’s world knowledge helps the robot interpret more complex, open-ended instructions. For example, when asked to “bring me a snack and something to wash it down with”, PaLM-SayCan recognizes this means to bring chips and a drink. Experiments showed improvements in planning and execution success rates compared to a baseline system without PaLM.
PaLM also enables new ways of communicating with robots, like chain of thought prompting. By showing the reasoning behind responses to sample queries, PaLM learns to logically think through new instructions. The robot experience grounds the language suggestions in what’s physically possible. PaLM-SayCan compares the suggestions from PaLM and the robot model to identify safe, achievable approaches.
Conclusion
Advanced computer vision paired with large language models offers promising techniques for enabling robots to dynamically understand and interact with uncontrolled real-world environments based on natural language prompts. As research in this area continues, it moves us toward more capable, flexible and helpful robots that can understand the way humans naturally communicate. Combining language and robotics will be key to creating robots that can truly make sense of and operate in our open-ended human environments.
 
				
