Hello and welcome to the final episode of our deep dive on NVIDIA’s GR00T N1. It’s been a fascinating journey so far, and now it’s time to look forward. What comes after GR00T N1? How is this model evolving, and what does it mean for the future of AI-powered humanoid robots? In this episode, we’ll talk about the immediate next step – the GR00T N1.5 update – and then zoom out to the broader implications for the industry and what might lie ahead in the world of generalist robot intelligence. Let’s start with GR00T N1.5, the successor and first major update to N1. NVIDIA introduced N1.5 a few months after N1, incorporating a host of improvements based on lessons learned and new techniques. Think of GR00T N1.5 as GR00T N1 after a round of intensive training and a few smart tweaks – it’s smarter, more precise, and even better at understanding language. Some key enhancements in N1.5 include: Better Vision-Language Understanding: NVIDIA upgraded the vision-language module (System 2, “The Thinker”). In N1.5, this module (the Eagle VLM) was further tuned to improve grounding – that is, connecting language to the right objects in the scene. For example, if you say “pick up the small green bottle,” N1.5 is much more likely to zero in on the correct item than N1 was. They achieved this by training the vision-language model on more data focused on referential understanding (like distinguishing objects by descriptions) and by freezing and fine-tuning it in a more effective way. The result: in tests, GR00T N1.5 was significantly better at following language instructions accurately, which is crucial for real-world use.Diffusion Action Module Tweaks: The action-generating part (System 1, “The Doer”) also got improvements. One big change was adding a new training objective known as FLARE (Future Latent Representation Alignment). Without getting too technical, FLARE helps the model learn from human videos more effectively by aligning what the model predicts will happen with what actually happens in example future video frames. This gave N1.5 a boost in learning from watching humans, something N1 was less efficient at. It means N1.5 can pick up skills or refine its movements by observing videos of humans doing tasks, broadening its learning sources.Efficiency and Stability: N1.5’s architecture was tuned for stability and generalization. NVIDIA simplified some of the connections (like the adapter that connects visual features to the language model) and applied better normalization techniques. These seemingly small changes led to a more reliable performance – kind of like tightening the bolts and oiling the joints of an already good machine to make it run even smoother.Training Scale: GR00T N1.5 was trained with even more compute and data (and thanks to tools like GR00T-Dreams, they could generate a lot of new synthetic training scenarios quickly). For instance, they managed to train N1.5’s new capabilities in just 36 hours of synthetic data generation and model update, something that would have taken months if done with manual data collection. This showcases how far the infrastructure has come – leveraging cloud simulation and powerful GPUs, an improved model can be spun up extremely fast. The quick turnaround from N1 to N1.5 hints that we might see frequent iterations and rapid improvements in these models. What do these improvements translate to in terms of performance? In both simulation and real-world evaluations, GR00T N1.5 outshines N1. We mentioned an example earlier: in a test where a real humanoid robot had to pick up a specific fruit (apple vs orange) and place it on a plate based on a verbal command, N1 was decent but N1.5 was almost flawless – going from roughly 50% success to over 90% in correctly following the command. In simulated benchmarks, N1.5 achieved success rates that were nearly double in some cases for language-conditioned tasks. It’s clear that as these models iterate, we’re seeing leaps in capability. It’s akin to how early versions of self-driving software struggled and improved incrementally; here we’re seeing a similar rapid refinement in the robot brain’s smarts. Looking beyond the immediate N1.5, what does the future hold? If we follow the trajectory, we can expect a GR00T N2 eventually, perhaps with even larger scale, maybe integrating more senses (could haptic feedback or sound be next?), and even more general capabilities. The concept of a World Foundation Model was hinted at – models that not only handle the robot’s immediate actions, but also have an understanding of the world physics and context (like predicting how things will change over time). We might see future versions incorporate explicit world modeling, basically giving the robot a mental simulation ability. Imagine a robot that can internally simulate “if I do this, what will happen?” before it even acts – that could prevent a lot of mistakes. Another aspect is the ...
続きを読む
一部表示