エピソード

  • Bedtime Diplomacy: Out-Trolling the Tiny Negotiator
    2024/11/26

    Parenting is a wild ride, and today was no exception. It’s 9:30 PM, and I’m trying to convince my toddler son that sleep is a wonderful idea. We’re snuggled in bed, and I pull out my classic argument: "All your friends are already asleep, buddy. If you want to have fun playing with them tomorrow at school, you need to rest now so you’re not sleepy during playtime."

    To my relief, he starts panning, tilting, and rolling—like a little camera drone settling into its dock. These are the telltale signs he’s beginning to wind down, his body finding its way to dreamland. Optimism fills the room; I imagine a smooth ride into his dreams and, for me, some quiet overtime work.

    Then comes his first demand: more milk. I call his bluff immediately.
    "Hey, we just had two bottles! No way you need more milk. Let’s go to sleep."
    He insists, and just to spice things up, threatens to cry. Classic toddler. I cave.

    I get up, tell him to hold onto his pillow, stay still, and wait for me. I return promptly with a bottle of 50% diluted milk—because, you know, compromise—and dangle it like the prize it is.
    "Let’s go to sleep now," I say, feeling victorious.

    It almost works. Almost.

    Twenty seconds later, his sleepy little brain realizes the milk is still in my hand. He asks for it. Fine. I hand it over, thinking, Now we’re good. Surely this is the endgame.

    But no. He clings to the last sliver of energy in his tiny body and declares he needs… water.

    Now, this is absolute BS. He never drinks water in the bedroom, and the untouched milk bottle in his hands would serve the same purpose. I try reasoning.
    "Hey, the milk will fix your thirst too! Drink that!"

    He threatens to cry again.

    And then it hits me: I’m the toy. I’m the shiny object keeping him from falling asleep.

    So, I try a new strategy. I get up, ask him (once again) to hold onto his pillow, and make intense eye contact to double-confirm he understands: stay still while I get your water.

    Here’s the twist—I don’t come back.

    Fifteen minutes later, he’s sound asleep, and I’m laughing with my wife about the sheer absurdity of the bedtime shenanigans.

    Lessons from the Bedtime Battlefield

    The Toddler:
    Our little guy has mastered the art of trolling adults. He knows exactly when he has the upper hand—bedtime. Crying equals delays, and delays mean the grown-ups lose. It’s mutual destruction, and he knows we fear it. Does he enjoy this? Probably not. But he’ll use it to his advantage anyway.

    The Adult:
    I played the long game. By predicting his energy levels were critically low, I pulled a next-level troll move: walking away from the negotiation table. He ran out of steam without my distraction. Game, set, match.

    Final Takeaways

    For the Toddler:
    Maybe trolling adults isn’t the best long-term strategy. It’s rude, inefficient, and ultimately a waste of everyone’s time—including his.

    For the Adult (me):
    Telling the toddler to wait and then not returning might not be the most trust-building move. Lesson learned.

    For Both of Us:
    Perhaps we’re better off skipping the trolling altogether and embracing some good, old-fashioned bedtime crying.

    And that’s parenting in a nutshell: a chaotic mix of humor, tactics, and lessons learned (mostly by me).

    続きを読む 一部表示
    3 分
  • Sylvia Plath's Ariel: A Cry for Escape
    2024/12/13

    Ariel
    By Sylvia Plath in her own voice

    Stasis in darkness.
    Then the substanceless blue
    Pour of tor and distances.

    God’s lioness,
    How one we grow,
    Pivot of heels and knees!—The furrow

    Splits and passes, sister to
    The brown arc
    Of the neck I cannot catch,

    Nigger-eye
    Berries cast dark
    Hooks—

    Black sweet blood mouthfuls,
    Shadows.
    Something else

    Hauls me through air—
    Thighs, hair;
    Flakes from my heels.

    White
    Godiva, I unpeel—
    Dead hands, dead stringencies.

    And now I
    Foam to wheat, a glitter of seas.
    The child’s cry

    Melts in the wall.
    And I
    Am the arrow,

    The dew that flies
    Suicidal, at one with the drive
    Into the red

    Eye, the cauldron of morning.

    Since my twenties, Sylvia Plath’s story has captivated me. The Bell Jar (1963), often regarded as semi-autobiographical, deeply resonated with me. The protagonist’s journey into a glamorous yet alienating world mirrored my own struggles with identity as a twenty-year-old. Back then, I read Ariel, and its haunting intensity felt like a fantastical, R-rated dream, much like the rest of her work.

    Life has since tempered me, pulling me through challenges that have deepened my empathy, particularly for those less privileged. Today, I can grasp nuances I previously missed. To borrow from South Park: I don’t get it, but I get it. Adding to this, the past two years of raising a child have brought firsthand insight into the societal expectation for mothers to provide unconditional care. When I stumbled upon Ariel yesterday while reading to my son at bedtime on my Kindle, its words struck a new chord. The poem vividly captures the physical and metaphorical experience of waking to a child’s cry—a moment of unease and entrapment.

    Stasis in darkness.

    She is asleep.

    Then the substanceless blue
    Pour of tor and distances.

    Plath’s imagery describes the gradual awakening. Consciousness emerges, shapeless at first, like a tenuous thread extending into the distance.

    God’s lioness,
    How one we grow,
    Pivot of heels and knees! — The furrow

    Her body begins to stir, and she regains control, as if piecing herself back together.

    Splits and passes, sister to
    The brown arc
    Of the neck I cannot catch,

    Fully awakening, she senses a disturbance—perhaps the cry of her child?

    Nigger-eye
    Berries cast dark
    Hooks—
    Black sweet blood mouthfuls,
    Shadows.

    Thirst seizes her, intensifying the disorientation. She hallucinates berries, their taste and texture vivid.

    Something else
    Hauls me through air—
    Thighs, hair;
    Flakes from my heels.

    At last, she is fully awake, tending to her baby—soothing, cooing, and comforting with all the exhaustion of countless nights before.

    White
    Godiva, I unpeel—
    Dead hands, dead stringencies.
    And now I
    Foam to wheat, a glitter of seas.

    Resigned, she lays the crying baby down, drained yet serene.

    The child’s cry
    Melts in the wall.

    Relief washes over her as she collects herself, savoring a brief moment of peace.

    And I
    Am the arrow,
    The dew that flies
    Suicidal, at one with the drive

    She reflects on her life, caught in a suffocating present, with no escape from a past that burdens her or a future that feels equally oppressive.

    Into the red
    Eye, the cauldron of morning.

    And so, she rises, ready to face the morning’s demands.

    Reading Ariel anew, I felt a profound connection to the relentless cycles of care it portrays, though Sylvia Plath’s perspective is far from celebratory. Her words reflect a sense of confusion and oppression, capturing the emotional toll of motherhood as a stifling, inescapable burden rather than a heroic endeavor.

    続きを読む 一部表示
    4 分
  • Christmas Tree Meets Cosmic Shadows
    2025/01/08

    I’ve always wished I had more time to dive into the intricacies of observational astronomy. I want to understand mechanisms that seem too absurd to reason about. What are the margins of error in our knowledge? What does it mean to have an acoustic wave in the early baryonic universe (Baryonic Acoustic Oscillation)? What does it mean to have an inflation of space itself happening everywhere at once (Cosmic inflation)?

    A common version of the Big Bang theory goes like this: In the beginning, the universe was incredibly small and unimaginably hot—so hot that it existed in high-energy states where even light couldn’t travel freely. Then came rapid inflation, which cooled the universe. During this cooling, lower-energy particles like baryons (matter) and photons (light) began to dominate. For the first time, light could travel through the universe. Because the universe wasn’t evenly distributed, this light wasn’t evenly dispersed. We observe this uneven first light today as the Cosmic Microwave Background (CMB). Over time, the cooling baryonic universe formed galaxies, stars, planets—and eventually dinosaurs.

    Over Christmas, as I sat in my living room with a toddler asleep on my lap, I found myself staring at the shadows cast by our mini Christmas tree in the corner. The warped shadows on the back wall baffled me—it was nearly impossible to deduce the structure of the object creating them. The light source was within the tree itself, small bulbs nestled among the leaves. The overlapping leaves caused the light to pass through layers of complete and partial occlusion. By the time it reached the edge of the tree, the light had lost much of its original shape.

    To the human eye, the resulting patterns appeared fractal. Trying to capture the intricacies was overwhelming—an exercise in frustration that could easily induce a headache. Setting aside the view of the leaves themselves, and focusing solely on the shadows, the tree’s mystery deepened. I could discern the shapes of one or two branches, but describing the overall structure, size, or density of the tree felt impossible.

    I believe astronomers face a similar challenge when studying the CMB. The patterns in the CMB are like the shadows of the early baryonic universe, just translucent enough for light to travel through. As light passed unevenly through this dense "baryonic forest," much like the branches and leaves of the Christmas tree, it left clues about the universe’s early structure.

    But the universe didn’t pause. It continued inflating everywhere at once, carrying the baryons with it. A vast baryonic sphere of unknown shape and dimension emerged, giving rise to the observable universe. We’re made of those baryons and live within that sphere. Looking around, much like a squirrel inside my Christmas tree, we only see the distorted shadows on the wall—the light warped and twisted by countless occlusions.

    I marvel at how astronomers make sense of the fascinating chaos of the CMB. How difficult it must be to draw conclusions from the shadowy fragments of the universe’s earliest light!

    続きを読む 一部表示
    3 分
  • A Refresh of Values: From Execution to Intention
    2025/01/12
    I recently stumbled upon a page from my LogSeq journal, written this past May after a night of deep introspection and conversation (with a bit of wine involved). At the time, I realized that my longtime mode of operation—my Old MO—was no longer serving me. In that journal entry, I explored what that Old MO looked like, why I needed to change, and the shape of the New MO I’m working toward. While I’ve made significant progress since then, revisiting that process reaffirmed my insights, and I felt compelled to share them. Old MO: Attachment to Execution, Gullibility, and Resistance In my Old MO, I assumed that having an idea meant I needed to execute it immediately. Just as some might jump headfirst into coding a new project without verifying its real-world relevance, I threw myself into life plans or personal ventures simply because they were “my ideas.” I poured effort into their execution—spending hours, days, or even months—without ever pausing to question whether they were genuinely worthwhile. This obsession with “getting things done” blinded me to the possibility that many of these ideas were, in fact, not worthy of deep emotional attachment. I’ve come to see that each idea is a product of a moment in my thinking process, and therefore a temporary construct. By treating every fleeting inspiration like a personal cause, I neglected to develop robust criteria for deciding which ideas truly deserved my time and passion. In a sense, I valued the outcome of my mind more than the quality of my mind’s process. I also fell prey to societal pressures in a way that made me gullible. I wanted to be a “model citizen”—the type who works diligently, pays taxes, and never questions how to benefit from the very system I was fueling. My reluctance to question norms intersected with a resistance to advice from others. If someone offered pointers about financial success, career paths, or investing, I often dismissed them as “too mainstream” or “too materialistic.” Ironically, that closed attitude prevented me from discovering valuable insights that could have broadened my perspective. To make matters worse, I prided myself on deferring to data whenever bigger life decisions arose. But by always waiting for external signs, I surrendered the power to define my own direction. It was as if my personal stance never evolved beyond checking off everyday tasks or chasing short-term wins. Without a deeper guiding principle, my path forward was reactive, shallow, and frequently unfulfilling. New MO: Clarity, Openness, Refined Metrics—and Money as a Tool In shaping my New MO, I’ve started by focusing on the human element. I respect myself as a person, strive for comfort and peace, and remind myself that life is an exercise of the soul—not merely a race to execute more tasks. No Emotional Attachment to Ideas, But to the Process A key part of this New MO is recognizing that ideas themselves are not sacred. They are mirages spun by the mind. The real treasure lies in how I generate and evaluate those ideas—my thinking process. By focusing on the process, I can detach from the outcomes and maintain a sense of balance. If an idea fails, that doesn’t mean I fail; it just means I need to refine my process. This perspective shift is huge because it keeps me from spiraling into frustration every time a venture doesn’t pan out as I’d hoped. I also recognize that other people often have better ideas or valuable insights into how to create good ideas, so I’m committed to learning from them. Confronting the Fear of Opportunity Another shift involves a willingness to confront my fear of opportunity. The scariest moves—like applying for a demanding role, delving into a field like psychology that makes me feel vulnerable, or reaching out to someone influential—won’t kill me. They threaten my ego, not my life. Now, I start by recognizing that I’m genuinely drawn to the potential benefits—be it more money, deeper insights, or powerful connections—then I research, prepare, and finally decide if the opportunity aligns with my broader plan. If the tough, intimidating route ultimately offers greater rewards, I commit to taking it. Spend More Time Examining Ideas Whether it’s in tech or in life, the mental muscle for examining ideas differs from the muscle for simply executing them. In personal life, this means dedicating real time to introspection. What do I actually desire? Why am I drawn to certain pursuits? Am I focusing too heavily on checking off mundane tasks and ignoring my deeper self? Just like refining my approach in a development cycle, I want to refine the ways I explore, sample, and test new life plans. Creating a long-term strategy—even if it’s just for the next two years—demands that I keep asking fundamental questions. What is fair? What is just? Where am I steering my future? By strengthening this “idea reasoning” muscle, I become more discerning about ...
    続きを読む 一部表示
    9 分
  • Episode 1: Introducing GR00T N1 – A New Era of Generalist Robots
    2025/07/22

    Hello and welcome! In this series, we’re diving deep into NVIDIA’s GR00T N1 model – a groundbreaking development that signals a new era for robotics. GR00T N1 is being hailed as the world’s first open, general-purpose foundation model for humanoid robots. If that sounds like a mouthful, don’t worry – we’ll break it all down over the coming episodes. But first, let’s set the stage for why this is such a big deal.

    For decades, teaching robots new skills has been a slow, painstaking process. Each new task – whether it’s picking up a specific object, folding laundry, or helping in a factory – often required crafting a specialized algorithm or training a model from scratch. Imagine having to retrain a child from zero for every single chore or job – not very efficient, right? Meanwhile, in other areas of AI like language and images, we’ve seen foundation models that learn from vast amounts of data and can generalize to many tasks. Think of large language models that can answer all sorts of questions after one big training journey. Robotics, however, lagged behind; robots were usually specialists, not generalists.

    That’s why NVIDIA’s announcement of GR00T N1 in early 2025 created such a buzz. Unveiled at a major industry conference, GR00T N1 is a humanoid robot foundation model. In simpler terms, it’s like an “AI brain” pre-trained on an enormous breadth of data so that it already has a wealth of general skills and knowledge about the physical world. And here’s the kicker – it’s open and customizable. NVIDIA made it available to the global robotics community, meaning researchers and companies can use this model as a starting point and fine-tune it to their own robots and tasks. This collaborative and open approach is poised to accelerate innovation in a field that’s been traditionally siloed.

    Why does this matter now? One reason is the growing demand for capable robots in many sectors – from warehouses and factories to hospitals and homes – especially in the face of labor shortages around the world. Industries are looking for robots that aren’t one-trick ponies, but can adapt to different jobs and environments. GR00T N1 aims to fill that gap by providing a kind of general-purpose intelligence for robots, analogous to human common sense and learning ability, that can be adapted quickly to new situations. NVIDIA’s CEO even proclaimed that “the age of generalist robotics is here,” highlighting the significance of this milestone.

    Over the next episodes, we will explore what exactly GR00T N1 is capable of, how it was built and trained, and what it means for the future of robotics. We’ll discuss the clever architecture that gives this model both “fast reflexes” and “thoughtful planning” abilities, the massive and diverse training process that taught it about the world, and the early real-world tests that show its potential. By the end, you’ll understand why GR00T N1 is not just another robot algorithm, but possibly the beginning of a new chapter in AI for machines that move and interact with us.

    So, whether you’re a robotics enthusiast, an AI researcher, or just curious about how close we are to having helpful humanoid assistants, stay tuned. In this first episode, we’ve set the scene and the motivation behind GR00T N1 – a push towards robots with a broader understanding. In the next episode, we’ll lift the hood and look at the brain of this robot AI: how does GR00T N1 actually work? Get ready to learn about its unique two-part architecture that’s inspired by human cognition.

    Thank you for joining us for this introduction. See you in Episode 2, where we delve into the inner workings of GR00T N1’s “mind”!

    続きを読む 一部表示
    4 分
  • Episode 2: Inside GR00T N1’s Dual-System Brain
    2025/07/22
    Welcome back to our deep dive on NVIDIA’s GR00T N1. In the last episode, we talked about how this model is ushering in a new era of generalist robots. Now it’s time to get technical and explore what’s inside GR00T N1’s “brain.” Don’t worry – we’ll keep it conversational. The architecture of GR00T N1 is one of the most fascinating aspects because it draws inspiration from the way we humans think and act. NVIDIA describes it as a dual-system architecture, and a great way to think about it is by comparing it to two modes of human cognition: a fast, intuitive side and a slow, reasoning side. Let’s break it down. GR00T N1 is a VLA model – that stands for Vision-Language-Action. In essence, it combines three key abilities: Vision: It can see the world through cameras or visual sensors, interpreting what’s around it. For example, it can look at a scene and identify objects – like recognizing a coffee mug on a table or a door in front of it.Language: It can understand instructions or descriptions in human language. You could tell the robot, “Open the door and fetch the coffee mug,” and GR00T N1’s language understanding kicks in to parse that request.Action: Based on the vision and language inputs, it can generate the appropriate motor actions to carry out the task. In our example, it would figure out the motions needed to walk to the door, turn the knob, then approach the table and grasp the mug. What makes GR00T N1 truly special is how it processes these three aspects in a coordinated way. This is where the dual-system architecture comes into play. The model essentially has two major components working hand-in-hand, which NVIDIA has playfully nicknamed System 2 and System 1 – a nod to psychological theories of human thinking (often called System 2 for slow thinking and System 1 for fast thinking). System 2 – “The Thinker”: This is the vision-language module, the part of GR00T N1 that does the understanding and planning. It’s like the robot’s deliberative mind. When the robot sees its environment (through cameras) and hears or reads an instruction, System 2 processes all that information. Under the hood, System 2 is powered by a large Vision-Language Model (VLM). NVIDIA uses a model codenamed Eagle as part of this – you can think of Eagle as a sophisticated neural network that has learned to connect images with text. System 2 will take in the camera images and any textual command, and then reason about “What’s the scene? What are the objects? What did the human ask me to do? What’s a sensible plan to achieve that?” It’s the slower, more analytical part – analogous to how you consciously figure out how to solve a problem or plan a task step by step.System 1 – “The Doer”: This is the action module, responsible for the robot’s movements. Once System 2 has formed a plan or an intent (“I need to go over there, pick up that mug and bring it back”), System 1 takes over to execute it. But executing a high-level plan involves a lot of continuous decisions – move each joint, maintain balance, adjust grip, etc. System 1 is designed to handle this fast, reflex-like control. Technically, NVIDIA built System 1 as a Diffusion Transformer (sometimes abbreviated as DiT). If that term sounds complex, think of it this way: System 1 uses a cutting-edge AI technique, inspired by diffusion models (the kind used to generate images like art from noise), to create smooth and realistic motion sequences for the robot. It’s as if System 1 is continuously generating the next split-second of motor commands that gradually turn an idea into action, much like how your subconscious can coordinate your muscles without you actively thinking about every muscle twitch. This diffusion-based approach helps in producing fluid movements rather than jerky, unnatural ones. Crucially, these two systems aren’t separate AI minds – they’re trained together, as one unified model, so that they complement each other. System 2 gives context and guidance to System 1, and System 1’s capabilities feedback into what System 2 can expect. For example, System 2 might output a plan like “approach the table, then extend right arm to grab the mug.” System 1 receives that in the form of an embedding or set of parameters and then handles the nitty-gritty of executing those steps in real time. If something is off – say the mug isn’t exactly where expected or starts to slip – System 1 can adjust on the fly, and System 2 can also re-evaluate if needed. It’s a tight coupling, much like how your intuitive actions and conscious thoughts work together seamlessly when you perform tasks. This architecture – having a reasoner and a doer – is novel in robotics at this scale. In the past, a robot might have had separate modules (vision module, planning module, control module) coded separately and interacting in rigid ways. GR00T N1 instead learns a holistic policy: from camera pixels and ...
    続きを読む 一部表示
    9 分
  • Episode 3: Training a Generalist – How GR00T N1 Learned to Act
    2025/07/22
    Welcome back! Now that we’ve uncovered the clever architecture of NVIDIA’s GR00T N1, it’s time to answer a big question: How do you teach a robot’s brain like this? After all, having a fancy design with Vision-Language and Action modules is one thing, but these modules don’t come out of the box knowing anything. They have to learn from data – and in the case of GR00T N1, a lot of data. In this episode, we’ll explore the training process of GR00T N1, which you can think of as the education of a robotic polymath. We’ll talk about what data was used, how it was used, and the sheer scale of the training effort. Training a generalist robot model is a massive undertaking because the model needs to gain experience in a huge variety of tasks and scenarios. NVIDIA approached this by feeding GR00T N1 with a heterogeneous mix of datasets. Instead of relying on one source, they combined many. Here are the major ingredients in GR00T’s training diet: Real Robot Demonstrations: This includes actual trajectories and sensor data from real robots performing tasks. For example, a human operator might remotely control a humanoid robot to perform hundreds of examples of picking up a box or opening a door. These real-world demonstrations provide ground truth examples of how tasks should be done in physical reality, including all the quirks and noise that come with real sensors and motors.Human Videos: Think of videos of people doing everyday tasks – cooking, cleaning, stacking objects, using tools. Such videos (possibly sourced from the internet or recorded in lab settings) show how humans interact with objects and their environment. From these, the model can learn concepts like what certain actions look like, how objects are typically grasped or manipulated, and the flow of multi-step activities. It’s like showing the robot “here’s how humans do it.” Even though the robot’s body is different, the high-level ideas can be useful.Synthetic Data from Simulation: This is a big one. NVIDIA leveraged simulation platforms (like their Isaac Sim and Omniverse environments) to create synthetic experiences for the model. In simulation, they can spawn endless variations of environments: different room layouts, different objects, random positions, and so on. They also can simulate multiple types of robots (different embodiments, from robotic arms to full humanoids). GR00T N1 was trained on an enormous amount of simulated robot data – think of virtual robots practicing tasks in virtual worlds. The benefit here is scale and diversity: you can generate more data in simulation than you could ever practically collect with real robots, and you can cover corner cases or dangerous scenarios safely. One particular simulation tool mentioned is Isaac GR00T-Dreams, a system that can generate synthetic “neural motion data” quickly by imagining a robot doing tasks in new environments. This kind of tool allowed NVIDIA’s team to produce thousands of unique training scenarios on the fly, dramatically reducing the need for months of manual data collection. All these sources were blended together to train GR00T N1 in an end-to-end fashion. Practically, during training, the model would be given a scenario (say an initial state of a robot and environment, plus an instruction like “move the cube to the shelf”) and it would attempt to generate the correct sequence of actions. When it was wrong, the training algorithm adjusted the model’s billions of parameters slightly to improve. Repeat this millions of times with varied tasks and data, and the model gradually learns. Now, training a model of this complexity is not just about data variety, but also about scale. NVIDIA trained GR00T N1 on their powerful GPU infrastructure. To give you an idea, later versions (like GR00T N1.5) were trained on 1,000 high-end GPUs for on the order of hundreds of thousands of iterations. That is an astronomical amount of compute, something only a few organizations in the world can throw at a single AI model. GR00T N1’s training likely involved similar industrial-scale compute. This heavy lifting is necessary because the model is so large (billions of neural network weights) and the task is so complex (learning vision, language, and control all at once). The upside of putting in that much compute effort is that once trained, the single resulting model encapsulates knowledge that would otherwise take many separate smaller projects to replicate. Let’s talk a bit more about the training techniques. We mentioned last episode that GR00T N1’s action module uses a diffusion-based approach. During training, they likely used something called a flow matching or diffusion loss. Without diving too deep into math, this means they added noise to correct action sequences and trained the model to reverse that noise – effectively teaching it to refine a rough guess of a movement into the precise movement needed. This method helps the model learn ...
    続きを読む 一部表示
    10 分
  • Episode 4: Skills and Smarts – What GR00T N1 Can Do
    2025/07/22
    Welcome back to our GR00T N1 deep dive. So far, we’ve covered the “what” and the “how” – what GR00T N1 is made of and how it was trained. Now it’s time for the exciting part: What can GR00T N1 actually do? In this episode, we’ll explore the capabilities of this model and why they’re a big leap forward for robotics. We’ll talk about the tasks it can perform, how well it performs them, and how it stacks up against previous methods. The ultimate goal of GR00T N1 is to give robots a broad set of generalized skills. Straight out of training, without heavy specialization, GR00T N1 can tackle a range of common manipulation tasks. These include things like: Grasping objects: The model can control a robot to reach out and grasp items, whether it’s picking up a tool, a toy, or a package. Importantly, it can handle both single-handed grasps or using two hands together for larger objects.Moving and placing objects: Once it picks something up, it can move it to a desired location. This could be as simple as moving a box from the floor to a shelf, or as involved as rearranging objects on a table following instructions.Hand-to-hand transfers: GR00T N1 even learned behaviors like passing an object from one hand to the other. Imagine a humanoid robot that picks up a can with its left hand, then transfers it to the right hand to place it on a higher shelf – that kind of coordinated bimanual action is within the model’s repertoire.Multi-step tasks: Because of the “thinker” part of the model, it can plan multiple steps in sequence. So if you tell the robot, “open the cabinet, then take out the bowl and put it on the counter,” GR00T N1 can break that down: open door (one action sequence), reach for bowl (next sequence), place bowl (next). It keeps the context of the overall goal so it can chain these skills together in the right order. What’s truly impressive is that GR00T N1 can generalize these skills to new combinations and contexts. It wasn’t explicitly pre-programmed for each exact scenario. Instead, because it has seen so many variations during training, it can adapt on the fly. For example, it learned the concept of “grasping” in general, so it can apply it to objects it hasn’t seen before, within reason. Or it understands the notion of left vs right hand, so it can decide to switch hands if a task would be easier that way. Now, how well does it do these things? NVIDIA and researchers put GR00T N1 through a battery of tests, both in simulations and in some real-world trials. In simulation benchmarks, GR00T N1 outperformed previous state-of-the-art models that were trained for each specific task (like imitation learning models specialized to certain environments). For instance, on standard robotic test tasks (such as stacking blocks or navigating a simple obstacle course to reach an object), GR00T N1 achieved higher success rates than models that didn’t have its benefit of broad training. This is remarkable because those specialized models had an advantage: they were tuned just for that task, whereas GR00T N1 was more of a generalist. Yet, the foundation model’s massive training gave it an edge, demonstrating the power of breadth of knowledge. One key area of evaluation was multiple robot embodiments. GR00T N1 was tested on controlling different kinds of robots – not just one specific humanoid. In simulations, they tried its brain on, say, a dexterous two-armed system and also on a different bipedal robot, and possibly even wheeled robots. The results showed that GR00T N1 could adapt with minimal adjustment, whereas traditionally you’d need to train a new model from scratch for each robot. This cross-embodiment skill is a game changer: it hints that we could have a single intelligent model that powers many kinds of robots in the future, much like one operating system can run on different hardware. Beyond simulation, let’s talk about real-world demonstrations, because that’s the true test. One milestone was deploying GR00T N1 on a real humanoid robot known as the Fourier GR-1. The GR-1 is a human-sized, bipedal robot developed by a company called Fourier Intelligence. Using GR00T N1 as its brain, the GR-1 was tasked with some language-conditioned bimanual manipulation tasks – essentially, the robot was given verbal instructions to use both its hands to do something, like “pick up the two objects and put them together” kind of tasks. The outcome? The robot performed impressively well, completing the tasks with high success rates. Even more striking was the data efficiency observed – the team didn’t need to collect months of new data on the real robot to make this work. They did a light fine-tuning with a small amount of robot-specific data, and GR00T N1 was able to generalize its learned skills to this actual machine’s body. Achieving fluent bimanual action on a physical humanoid is a big step forward, since coordinating two arms, vision, and ...
    続きを読む 一部表示
    11 分