If you follow AI, it is hard to miss the constant stream of clips showing humanoids walking around warehouses, robot arms folding laundry, and headlines about “embodied intelligence.” At the same time, if you walk through most factories, hospitals, or homes, you will not see many of these robots in day‑to‑day use. That disconnect is what pushed this essay into existence.
Over the past few months, a lot of my attention has been on robotics: where the funding is flowing, what the most serious labs are publishing, and what is actually getting deployed in the wild rather than in tightly controlled demos. What emerges is an industry sitting on a fault line. One side is piled high with capital and genuinely new AI models. The other is stuck in the very physical problem of moving atoms in messy environments, at the right speed, for the right price. The contrast shows up clearly in the numbers. In 2024, robotics investment hit roughly 21 billion dollars, bolstered by defense spending and the belief that “robot foundation models” are the next platform. Cross‑embodiment AI companies alone reportedly raised around 5 billion, about 150 percent more than the previous year. Yet in that same period the world installed only about 622,000 industrial robots, barely more than the number of cars Tesla sells in a quarter.
This piece is an attempt to explain that gap in a way that still leaves space for optimism. My view is that four bottlenecks matter the most right now: how robot data is distributed, how quickly systems can react, how hardware form factors are evolving, and how the current business structure channels all of this into real deployments.
The Distribution Problem
Large language models had the luxury of training on something like “all public text on the internet.” Rough orders of magnitude are in the trillions of tokens. Robotics is nowhere near that. The biggest public numbers I have seen for manipulation data are measured in the hundreds of thousands of trajectories or a few hundred thousand hours of control, and even the ambitious private datasets are still at that scale.
The issue is not only how much data exists, but what kind of data it needs to be. Language models can learn a lot from passive observation. They do not need to see the exact keyboard motions that produced a sentence. They only need the sentence itself. Robots do not have this luxury. A robot needs data that pairs sensory streams with specific actions, and ideally with what should have been done when something goes wrong. That means timestamps, joint angles, Cartesian poses, forces, tactile readings, gripper state, and success or failure labels, all aligned at thirty Hertz or more. It is what Sergey Levine called “the internet of robot data,” and it simply does not exist yet.
Take Generalist AI’s GEN‑0 as a concrete example. It is reportedly trained on roughly 270,000 hours of real‑world manipulation and grows by around ten thousand hours per week. In historical robotics terms this is huge. Compared with the data that powered GPT‑4 and its peers, it is still several orders of magnitude smaller. And even within those hours, a lot of the raw streams are not equally useful. A single teleoperation session can produce gigabytes of logs, but most of that is routine motion. The truly instructive moments are sparse: the tiny adjustments before a grasp, the way contact forces behave as an object starts to slip, the recovery strategy when a plan fails.
Physical Intelligence’s π0 model is, to me, the clearest proof that there is a path forward despite this. It is trained on demonstrations from about thirty different robot embodiments and roughly nine hundred thousand trajectories. One policy can drive arms, quadrupeds, and other bodies. That is remarkable and suggests that if you can pool data across many platforms, you can compensate for the fact that no single robot will ever see everything. The catch is that creating this level of diversity requires either heavy bespoke infrastructure, or very clever ways of multiplying sparse real data with simulation and human video.
Two Paths Forward
With that data backdrop in mind, it is easier to understand how the industry is splitting into two camps.
The first camp is made up of companies that focus on narrow, well‑defined tasks: warehouse picking, parcel sortation, pallet handling, and a handful of similar jobs. Their constraint is not so much intelligence as reliability. They use teleoperation and large fleets to gather enormous amounts of data on a small set of patterns in a controlled environment. The result is task‑specific models that hit eighty to ninety‑five percent of human performance on those tasks and can run more or less unattended. Covariant is a good example. Its system learns from “warehouse‑scale” data collected from hundreds of robots that all see the same families of boxes, totes, and shelves, which makes continuous fleet learning tractable.
The second camp is chasing generality. These are the groups that want a single policy that can walk into an unseen kitchen and cook breakfast, or into a new factory and adapt to whatever assembly procedure is running that day. They cannot rely on brute‑forcing data for every environment. Instead they lean on three ingredients. First, large corpora of human videos to capture the richness of real manipulation. Second, internet‑scale vision–language pretraining to give their models a deep grounding in semantics. Third, simulation to generate many variations on physical tasks that would be prohibitively expensive to collect purely in the real world.
Architecturally, the most interesting change in the past year has been the emergence of what I think of as “two‑brain systems.” One “brain” is built on a vision–language model that reasons in a slow, symbolic way about goals and plans. The other is a fast controller that runs at tens of Hertz and cares about torques, trajectories, and contacts. Physical Intelligence’s π0.5, DeepMind’s Gemini Robotics stack, and NVIDIA’s GR00T all follow this rough shape. A high‑level planner decides what to do, and a low‑level policy decides how to move.
The payoff is that these systems can now handle instructions that are closer to how humans talk. You can tell a robot “clear the table” in a home it has never seen before. It can search for relevant objects, avoid knocking things over, and do something reasonable. More impressively, if you interrupt halfway through and say “actually, leave the glass and the phone,” it can change its plan without starting from scratch. That level of online adaptation would have been considered science fiction not long ago.
The Reaction Time Problem
Even with better architectures, one hard constraint has not budged much: reaction time.
Most frontier stacks today still sit around two hundred to three hundred milliseconds from sensing to action. That is perfectly adequate if the robot is lifting a stationary box off a shelf or placing an item into a bin. It is not adequate if the robot is supposed to catch a falling object, hand you a tool without fumbling, or operate safely right next to you in a tight space.
Humans adjust their movements at roughly ten Hertz. To feel natural and safe, a robot that is interacting with people has to approach that rhythm. Eric Jang popularized the phrase “ultra instinct” for reactions in the sub‑fifty‑millisecond range, which is about where fluid human–robot collaboration starts to feel possible. Getting there is not as simple as “run the same model faster.”
There are at least three intertwined problems. First, you need a way to compress smooth motion into a representation that is both expressive and efficient, so that you can predict and update actions quickly. This is the action tokenization problem that work like FAST and BEAST tries to address. Second, you need more of the model to run on‑device, which means you cannot just push everything through a giant cloud‑hosted model at interactive rates without facing brutal cost curves. Third, you need an end‑to‑end design where perception, planning, and control are all optimized together for latency, not treated as separate afterthoughts.
The economics are particularly brutal. Running a large VLM in the cloud costs roughly $0.02-0.10 per decision at 1Hz, which balloons to $864/day for continuous operation—more than most people’s rent. Edge deployment on $200 hardware? Forget about it. You’re forced to choose: either powerful models at prohibitive costs, or stripped-down versions with limited capabilities.
The direction that makes sense to me is a hybrid one. For tasks that genuinely require deep reasoning—understanding a new instruction, planning a long sequence, recovering from a novel failure—the robot can afford to query a large model that lives in the cloud or on a local server. For everything else, especially fast reflexes and routine motions, it should rely on smaller, distilled models running locally at high frequency. Several of the leading groups are converging on something like this division of labor.
The Form Factor Lock-In
Note
💡 Key Insight: The cost of building custom hardware is what’s really holding back robotics innovation. You can iterate on software daily, but hardware iterations take months and millions of dollars.
Here’s the hardware dilemma nobody wants to talk about: developing a new robot form factor costs $50-100 million minimum. That’s before you’ve sold a single unit. This creates a vicious cycle where companies stick to proven designs (arms for factories, quadrupeds for inspection) even when better forms might exist for specific applications.
The economics break down like this:
- Initial design and engineering: $20-30M
- Tooling and manufacturing setup: $20-40M
- Safety certification and testing: $10-20M
- Minimum viable production run: $10-20M
Physical Intelligence chose to be hardware-agnostic for precisely this reason. Rather than betting on any single form factor, they’re building generalist policies that work across different robot bodies. It’s a hedge against hardware lock-in and acknowledges an uncomfortable truth: we still don’t know what the “right” robot design looks like for most tasks.
Note
⚠️ Reality Check: Tesla’s Optimus and Figure’s humanoid bets are $2+ billion gambles that human form factor is optimal for human environments. The jury is still very much out.
What we desperately need is hardware that can be reconfigured as easily as software. Modular robotics companies like Halodi are exploring this path, but the complexity compounds quickly. Every joint, actuator, and sensor adds not just cost but integration challenges that ripple through the entire system.
Meanwhile, Chinese manufacturers are taking a different approach: flood the market with cheap, “good enough” hardware and let the software developers figure out what to do with it. Unitree’s $16,000 humanoid and $1,600 quadruped are loss leaders designed to establish market dominance. It’s the smartphone playbook applied to robotics, and it might actually work.
The Simulation Question
Because real robot data is expensive and slow to collect, every serious lab is looking for ways to manufacture experience. The main approaches fall into three buckets, each with very different trade‑offs.
The first bucket is high‑throughput physics simulation. Systems like ManiSkill3 keep an entire world on the GPU and can generate tens of thousands of frames per second, complete with contacts, collisions, and rigid‑body dynamics. This is incredibly useful for exploring large numbers of variations or for training policies that need millions of steps. The weakness is that it is hard to model all the small annoyances of the real world: plastic bags that stick to grippers, elastic objects that stretch in strange ways, cables that tangle, and combinations of friction and compliance that do not match the simulator’s assumptions. Policies trained purely in sim tend to break on those details.
The second bucket is “world models.” Instead of enforcing hard physics, they learn to predict future observations given actions. Wayve’s GAIA‑2 and DeepMind’s Genie 3 are examples. GAIA‑2 can roll out multi‑camera driving scenes and systematically vary rare events. Genie 3 can generate coherent, navigable 3D environments from text prompts that persist over time. The advantage is realism and diversity. The downside is control. It is still very hard to tell these models “apply exactly this amount of force at exactly this angle” and trust that they will respect that constraint.
The third line of attack tries to sidestep both issues by learning more from the real world itself. One angle is to treat simulation as a strong prior, and then train separate correction models that learn how to adjust policies when they hit the messy edge cases of reality. Work like ASAP falls into this category. Another is to use human video directly. Labs such as Physical Intelligence and others are training models on huge amounts of internet video, trying to infer the latent actions behind what people do with their hands and bodies.
I find this last strategy particularly compelling. It uses the real distribution of human behavior rather than a synthetic approximation. You still need some robot data to “ground” the model in the specifics of your hardware, but the ratio changes. Meta’s V‑JEPA is a good illustration: pretrain on on the order of a million hours of video, then fine‑tune with something like sixty hours of targeted robot interaction, and you can achieve surprisingly strong manipulation performance.
What’s Actually Deployed
All of this would be abstract if not for the deployment numbers, which are sobering but useful as context.
There are roughly 4.3 million industrial robots in operation worldwide. Most of them are in automotive and electronics plants, doing the standard jobs: welding, painting, pick‑and‑place, assembly. By 2025, something like 16 million service robots will have been deployed, with a little over half of them using some form of AI for autonomy.
Humanoids have crossed an important psychological threshold by getting their first paying deployments. Agility’s Digit has been working at a Spanx facility. Figure’s robot is being piloted at BMW for parts handling. On the mobility side, Waymo is now providing on the order of hundreds of thousands of robotaxi rides per week across several US cities.
Distribution by country is extremely uneven. South Korea has more than a thousand robots per ten thousand manufacturing workers, roughly one robot for every seventy‑one people in that sector. The United States sits around three hundred per ten thousand. The global average hovers around one hundred and fifty.
The key point is that most of these robots are still stuck in narrow, repetitive roles. The companies that are clearly profitable today are the ones automating very specific functions such as warehouse picking, industrial sortation, or inventory movement. Truly general‑purpose manipulation is still mostly confined to pilots and staged demonstrations.
Industry Structure Taking Shape
Underneath all of this, the industry is starting to settle into a three‑layer structure.
At the top are the labs and companies training broad foundation models for robot control. Physical Intelligence, Skild, DeepMind, NVIDIA and a few others sit here. They need enormous amounts of capital, diverse fleets of robots, significant compute infrastructure, and teams that are comfortable at the intersection of machine learning, controls, and systems engineering. Only a small number of players can operate at this level.
The middle layer consists of vertical deployment companies. These are the ones that take general models and adapt them to specific industries or environments. Their advantages are not only in modeling but also in relationships, integration expertise, safety and regulatory knowledge, and the proprietary data that comes from being embedded at customer sites. There is an interesting feedback loop here. Foundation labs need diverse real‑world data, which deployment companies produce. Deployment companies need stronger base models, which foundation labs provide.
The bottom layer is hardware. Here the fragmentation is enormous. Humanoids like Figure, 1X, and Tesla Optimus target general tasks in human environments. Collaborative arms aim at safe, close‑proximity industrial work; Universal Robots alone has over a hundred thousand cobots deployed. Specialized platforms—quadrupeds for inspection, snake robots for pipes and tunnels, micro‑drones for confined spaces—focus on tasks where mobility constraints dominate.
The open business question is where most of the value will accumulate. If you believe the software platform analogy, then the general model providers capture the lion’s share. If you look at the history of industrial robotics, integration and service companies often own the deepest customer relationships. The reality may be somewhere in between, and we do not yet know which mix of platform and vertical integration will win.
The Timeline Ahead
Any attempt to sketch a timeline is speculative, but it is still useful to put a few stakes in the ground.
Between 2025 and 2026, it is reasonable to expect the first wave of AI‑controlled robots to replace humans in certain narrow verticals at noticeable scale. Task‑specific robotics companies will move from dozens or hundreds of deployed units into the thousands, and some will reach meaningful revenue numbers.
From 2026 through 2030, general models should improve markedly as more heterogeneous data and fleet‑level reinforcement learning come online. Robots will spread further into core sectors like manufacturing, farming, and construction. At the same time, the balance of data collection will likely start to shift away from pure teleoperation toward approaches that use human video and more efficient on‑policy learning.
In the 2030 to 2035 window, it is not crazy to imagine humanoids or other general bodies handling tasks that require longer‑horizon planning and more adaptive behavior. Manufacturing may start to see robots building key components of other robots, accelerating the growth loop. Governments will almost certainly pay more attention to actuator supply chains and rare‑earth minerals as strategic assets.
Somewhere between 2035 and 2045, if all of the constraints keep getting pushed, we may see something close to embodied general intelligence in the economic sense. That would mean robots matching human‑level performance across a broad range of tasks that contribute materially to GDP, supported by huge global fleets and continuous online learning. At that point, the automation wave will reach not only factories and fields but also service work, healthcare, education, and domestic life.
All of this depends on continued progress in at least four areas at once: scaling data, narrowing the sim‑to‑real gap, reaching sub‑fifty‑millisecond dexterity where it matters, and giving robots enough memory and reasoning to handle multi‑step tasks without unravelling. Any one of those could easily turn out to be harder than optimistic roadmaps suggest.
What to Watch
Rather than trying to read the future from individual demo videos, I find it more helpful to track a few simple indicators.
Note
📊 Key Metrics: Deployment numbers, task diversity, data throughput, reaction time benchmarks, and sim-to-real transfer rates tell you more about progress than flashy demos.
One is deployment: how many robots are actually working in production environments for paying customers, not sitting in pilot programs. Another is task diversity: how many distinct, clearly different tasks a typical deployed robot can handle well, rather than how many variations of one task. A third is data throughput: how many hours of diverse, multi‑embodiment robot experience the leading labs can collect or synthesize each week. That number will dictate how quickly their models can improve.
Reaction time is another obvious one. Until systems reliably move from a couple of hundred milliseconds toward that sub‑fifty‑millisecond regime, truly dexterous, human‑safe manipulation will stay out of reach. Finally, there is sim‑to‑real transfer: what fraction of capabilities that look good in simulation work the first time they are tried on real hardware. That single statistic is a decent proxy for how valuable synthetic data really is.
Why This Matters
The stakes here are not just technical.
Manufacturing, logistics, and agriculture together employ billions of people and generate tens of trillions of dollars in output. Even partial automation in those sectors will produce strong winner‑take‑most dynamics. The geopolitical layer is also hard to ignore. Asia, and China in particular, dominates robot deployment, hardware manufacturing, and key parts of the supply chain. They control most rare‑earth refining and a very large fraction of magnet production. The United States and parts of Europe still have an advantage in AI software and compute. The question is whether that software lead can be converted into hardware and manufacturing strength fast enough, or whether existing industrial capacity will compound into an unshakeable edge elsewhere.
Socially, we are just at the beginning of the conversation. As automation spreads, whole categories of work will be reshaped or disappear. Debates over basic income, new social safety nets, and how to distribute gains from automation are already surfacing. Companion robots and androids will intersect with trends in loneliness and declining birth rates in ways we are not fully prepared for. Concentrated control over robot fleets and infrastructure raises questions about power and democratic oversight.
On the technical side, wrestling with robotics pushes AI in directions that pure language work could sometimes avoid: long‑term memory, online adaptation, safety guarantees, strict latency constraints, and real uncertainty about the world. Progress there will feed back into other areas of AI whether we intend it or not.
The Open Questions
A few questions remain genuinely open in my mind.
Which form factors will dominate by volume. Humanoids are attractive because our environments are built for human bodies, but they are complex and expensive. Cobots have better near‑term economics but narrower scope. Specialized platforms can be extremely capable in a niche and useless outside it. It is not obvious where the mass will land.
What the right training mix looks like. Nobody knows the optimal balance between real robot data, simulation, and human video for a given task. That answer will almost certainly vary by domain. We are only starting to get empirical hints as labs iterate.
How far simulation can really go. The synthetic‑to‑real ratio is improving, but it is not clear where it will plateau. There may always be corners of reality that resist clean modeling and force us to collect data the hard way.
When the “robots building robots” loop turns on in a serious way. Once robots can reliably assemble the parts of other robots, costs should fall and deployment should accelerate. Estimates for when that happens in a meaningful sense range from early 2030s to something closer to 2040.
And finally, how quickly labor markets and policy can adjust. Previous automation waves took decades to absorb. The pace of change in AI and robotics is much faster, and the institutional plumbing needed to manage that shift is not yet in place.
Resources for Going Deeper
If you want to explore this space more, there are a few categories of material worth looking at.
On the technical side, work like π0 and π0.5 from Physical Intelligence, Gemini Robotics from DeepMind, GEN‑0 from Generalist AI, and NVIDIA’s GR00T gives a sense of how cross‑embodiment and hierarchical control are being approached. Research on action tokenization and low‑latency control, such as FAST, BEAST, and Dex1B, dives into the reaction‑time side of the problem.
For industry and investment context, F‑Prime Capital’s “State of Robotics 2025,” Coatue’s analysis of why robotics will not have a single “ChatGPT moment,” and the International Federation of Robotics’ reports on global deployment and government programs are useful reality checks.
Finally, for commentary and synthesis, blogs and talks from people like Sergey Levine, Rohit Bandaru, and others who live close to both the research and the deployments are particularly helpful in cutting through hype.
Note
🚀 The Bottom Line: The field has moved past the question of whether useful, broadly capable robots are possible. The path is visible even if it is steep and uneven. What we are doing now is the long, unglamorous part: engineering our way there, one dataset, one policy update, and one deployment at a time.