Episode 3: Training a Generalist – How GR00T N1 Learned to Act

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

Episode 3: Training a Generalist – How GR00T N1 Learned to Act

無料で聴く

ポッドキャストの詳細を見る

このコンテンツについて

Welcome back! Now that we’ve uncovered the clever architecture of NVIDIA’s GR00T N1, it’s time to answer a big question: How do you teach a robot’s brain like this? After all, having a fancy design with Vision-Language and Action modules is one thing, but these modules don’t come out of the box knowing anything. They have to learn from data – and in the case of GR00T N1, a lot of data. In this episode, we’ll explore the training process of GR00T N1, which you can think of as the education of a robotic polymath. We’ll talk about what data was used, how it was used, and the sheer scale of the training effort. Training a generalist robot model is a massive undertaking because the model needs to gain experience in a huge variety of tasks and scenarios. NVIDIA approached this by feeding GR00T N1 with a heterogeneous mix of datasets. Instead of relying on one source, they combined many. Here are the major ingredients in GR00T’s training diet: Real Robot Demonstrations: This includes actual trajectories and sensor data from real robots performing tasks. For example, a human operator might remotely control a humanoid robot to perform hundreds of examples of picking up a box or opening a door. These real-world demonstrations provide ground truth examples of how tasks should be done in physical reality, including all the quirks and noise that come with real sensors and motors.Human Videos: Think of videos of people doing everyday tasks – cooking, cleaning, stacking objects, using tools. Such videos (possibly sourced from the internet or recorded in lab settings) show how humans interact with objects and their environment. From these, the model can learn concepts like what certain actions look like, how objects are typically grasped or manipulated, and the flow of multi-step activities. It’s like showing the robot “here’s how humans do it.” Even though the robot’s body is different, the high-level ideas can be useful.Synthetic Data from Simulation: This is a big one. NVIDIA leveraged simulation platforms (like their Isaac Sim and Omniverse environments) to create synthetic experiences for the model. In simulation, they can spawn endless variations of environments: different room layouts, different objects, random positions, and so on. They also can simulate multiple types of robots (different embodiments, from robotic arms to full humanoids). GR00T N1 was trained on an enormous amount of simulated robot data – think of virtual robots practicing tasks in virtual worlds. The benefit here is scale and diversity: you can generate more data in simulation than you could ever practically collect with real robots, and you can cover corner cases or dangerous scenarios safely. One particular simulation tool mentioned is Isaac GR00T-Dreams, a system that can generate synthetic “neural motion data” quickly by imagining a robot doing tasks in new environments. This kind of tool allowed NVIDIA’s team to produce thousands of unique training scenarios on the fly, dramatically reducing the need for months of manual data collection. All these sources were blended together to train GR00T N1 in an end-to-end fashion. Practically, during training, the model would be given a scenario (say an initial state of a robot and environment, plus an instruction like “move the cube to the shelf”) and it would attempt to generate the correct sequence of actions. When it was wrong, the training algorithm adjusted the model’s billions of parameters slightly to improve. Repeat this millions of times with varied tasks and data, and the model gradually learns. Now, training a model of this complexity is not just about data variety, but also about scale. NVIDIA trained GR00T N1 on their powerful GPU infrastructure. To give you an idea, later versions (like GR00T N1.5) were trained on 1,000 high-end GPUs for on the order of hundreds of thousands of iterations. That is an astronomical amount of compute, something only a few organizations in the world can throw at a single AI model. GR00T N1’s training likely involved similar industrial-scale compute. This heavy lifting is necessary because the model is so large (billions of neural network weights) and the task is so complex (learning vision, language, and control all at once). The upside of putting in that much compute effort is that once trained, the single resulting model encapsulates knowledge that would otherwise take many separate smaller projects to replicate. Let’s talk a bit more about the training techniques. We mentioned last episode that GR00T N1’s action module uses a diffusion-based approach. During training, they likely used something called a flow matching or diffusion loss. Without diving too deep into math, this means they added noise to correct action sequences and trained the model to reverse that noise – effectively teaching it to refine a rough guess of a movement into the precise movement needed. This method helps the model learn ...