Episode 2: Inside GR00T N1’s Dual-System Brain

カートのアイテムが多すぎます

ご購入は五十タイトルがカートに入っている場合のみです。

カートに追加できませんでした。

しばらく経ってから再度お試しください。

ウィッシュリストに追加できませんでした。

しばらく経ってから再度お試しください。

ほしい物リストの削除に失敗しました。

しばらく経ってから再度お試しください。

ポッドキャストのフォローに失敗しました

ポッドキャストのフォロー解除に失敗しました

Episode 2: Inside GR00T N1’s Dual-System Brain

無料で聴く

ポッドキャストの詳細を見る

このコンテンツについて

Welcome back to our deep dive on NVIDIA’s GR00T N1. In the last episode, we talked about how this model is ushering in a new era of generalist robots. Now it’s time to get technical and explore what’s inside GR00T N1’s “brain.” Don’t worry – we’ll keep it conversational. The architecture of GR00T N1 is one of the most fascinating aspects because it draws inspiration from the way we humans think and act. NVIDIA describes it as a dual-system architecture, and a great way to think about it is by comparing it to two modes of human cognition: a fast, intuitive side and a slow, reasoning side. Let’s break it down. GR00T N1 is a VLA model – that stands for Vision-Language-Action. In essence, it combines three key abilities: Vision: It can see the world through cameras or visual sensors, interpreting what’s around it. For example, it can look at a scene and identify objects – like recognizing a coffee mug on a table or a door in front of it.Language: It can understand instructions or descriptions in human language. You could tell the robot, “Open the door and fetch the coffee mug,” and GR00T N1’s language understanding kicks in to parse that request.Action: Based on the vision and language inputs, it can generate the appropriate motor actions to carry out the task. In our example, it would figure out the motions needed to walk to the door, turn the knob, then approach the table and grasp the mug. What makes GR00T N1 truly special is how it processes these three aspects in a coordinated way. This is where the dual-system architecture comes into play. The model essentially has two major components working hand-in-hand, which NVIDIA has playfully nicknamed System 2 and System 1 – a nod to psychological theories of human thinking (often called System 2 for slow thinking and System 1 for fast thinking). System 2 – “The Thinker”: This is the vision-language module, the part of GR00T N1 that does the understanding and planning. It’s like the robot’s deliberative mind. When the robot sees its environment (through cameras) and hears or reads an instruction, System 2 processes all that information. Under the hood, System 2 is powered by a large Vision-Language Model (VLM). NVIDIA uses a model codenamed Eagle as part of this – you can think of Eagle as a sophisticated neural network that has learned to connect images with text. System 2 will take in the camera images and any textual command, and then reason about “What’s the scene? What are the objects? What did the human ask me to do? What’s a sensible plan to achieve that?” It’s the slower, more analytical part – analogous to how you consciously figure out how to solve a problem or plan a task step by step.System 1 – “The Doer”: This is the action module, responsible for the robot’s movements. Once System 2 has formed a plan or an intent (“I need to go over there, pick up that mug and bring it back”), System 1 takes over to execute it. But executing a high-level plan involves a lot of continuous decisions – move each joint, maintain balance, adjust grip, etc. System 1 is designed to handle this fast, reflex-like control. Technically, NVIDIA built System 1 as a Diffusion Transformer (sometimes abbreviated as DiT). If that term sounds complex, think of it this way: System 1 uses a cutting-edge AI technique, inspired by diffusion models (the kind used to generate images like art from noise), to create smooth and realistic motion sequences for the robot. It’s as if System 1 is continuously generating the next split-second of motor commands that gradually turn an idea into action, much like how your subconscious can coordinate your muscles without you actively thinking about every muscle twitch. This diffusion-based approach helps in producing fluid movements rather than jerky, unnatural ones. Crucially, these two systems aren’t separate AI minds – they’re trained together, as one unified model, so that they complement each other. System 2 gives context and guidance to System 1, and System 1’s capabilities feedback into what System 2 can expect. For example, System 2 might output a plan like “approach the table, then extend right arm to grab the mug.” System 1 receives that in the form of an embedding or set of parameters and then handles the nitty-gritty of executing those steps in real time. If something is off – say the mug isn’t exactly where expected or starts to slip – System 1 can adjust on the fly, and System 2 can also re-evaluate if needed. It’s a tight coupling, much like how your intuitive actions and conscious thoughts work together seamlessly when you perform tasks. This architecture – having a reasoner and a doer – is novel in robotics at this scale. In the past, a robot might have had separate modules (vision module, planning module, control module) coded separately and interacting in rigid ways. GR00T N1 instead learns a holistic policy: from camera pixels and ...