NVIDIA has unveiled ReMEmbR, a groundbreaking project that leverages generative AI to enable robots to reason and act based on their extended observations, according to the NVIDIA Technical Blog.
Innovative Vision-Language Models
Vision-language models (VLMs) combine the robust language understanding of foundational large language models (LLMs) with the vision capabilities of vision transformers (ViTs). These models project text and images into the same embedding space, allowing them to handle unstructured multimodal data, reason over it, and return structured outputs. By building on extensive pretraining, VLMs can be adapted for various vision-related tasks with new prompts or parameter-efficient fine-tuning.
ReMEmbR: Enhancing Robot Perception and Autonomy
ReMEmbR integrates LLMs, VLMs, and retrieval-augmented generation (RAG) to enable robots to reason and act based on what they observe over extended periods, ranging from hours to days. The system is designed to address challenges such as handling large contexts, reasoning over spatial memory, and building prompt-based agents to query additional data until a user’s question is answered.
The project’s memory-building phase uses VLMs and vector databases to create a long-horizon semantic memory. During the querying phase, an LLM agent reasons over this memory. ReMEmbR is fully open-source and operates on-device, making it accessible for various applications.
Practical Applications and Demonstrations
To demonstrate ReMEmbR’s capabilities, NVIDIA developed a practical example using Nova Carter and NVIDIA Isaac ROS. The robot, equipped with ReMEmbR, can answer questions and guide individuals within an office environment. This demonstration highlights the system’s ability to build an occupancy grid map, run the memory builder, and operate the ReMEmbR agent.
In the demo, the robot uses a monocular camera and global location information to create a vector database. This database stores text embeddings, timestamps, and pose information, allowing the robot to efficiently query and retrieve information to perform tasks such as guiding users to specific locations.
Integration with Speech Recognition
Recognizing the need for intuitive user interaction, NVIDIA integrated speech recognition into the ReMEmbR system. Using the WhisperTRT project, which optimizes OpenAI’s Whisper model with NVIDIA TensorRT, the robot can process spoken queries and generate appropriate responses, enhancing user experience.
Future Prospects
ReMEmbR’s innovative approach to combining generative AI, VLMs, and RAG opens up new possibilities for robotic applications. By providing robots with the ability to reason and act based on extended observations, this technology has the potential to revolutionize fields such as autonomous navigation, surveillance, and interactive assistance.
For those interested in exploring generative AI in robotics, NVIDIA offers extensive resources and documentation through its Developer Program. This includes tutorials, code samples, and community support to help developers get started with their own generative AI robotics applications.
Image source: Shutterstock