Over the past year, the embodied AI sector has kept heating up. From unmanned machine tools humming on factory floors and logistics parks running round-the-clock autonomous sorting, to surgical robotic arms moving inside hospital operating rooms, physical-world productivity has been lifted by more than an order of magnitude.
In this trillion-yuan arena, embodied data is turning into the “water, electricity and coal” of the embodied AI industry – a strategic, infrastructure-level resource.
Imagine a scene: a robot arm steadily grips a cup of milk, adjusts its force and smoothly moves the cup toward a person, all while keeping joint vibrations extremely low. The cup barely wobbles. The arm delivers the milk securely into the person’s open hand and only releases its gripper after confirming a firm hold. Not a drop is spilled.
What looks simple actually involves layered machine perception and behavioral logic. Just as the mobile internet depends on signal towers lining streets and alleys, robots need a data infrastructure platform to learn physical common sense and handle real-world scenarios.
A consensus has been forming around this point. At CES 2026, Jensen Huang drove it home with a blunt statement: “Without real-world data, embodied intelligence is just a hallucination.” The line jolted the industry.
Capital from multiple directions is now pouring in, widening the imagination for the embodied data industry.
Part 1: The race for embodied data intensifies
In China’s embodied data race, several deep-tech specialists have broken through.
Lightwheel AI, founded in 2023, focuses on turning the complexity of the real physical world into data that robots can learn from. It started with only a handful of backers. By the first quarter of this year, it had closed a fresh 1 billion yuan funding round, drawing in three manufacturing heavyweights – New Hope Group, AUX and Sanan Optoelectronics – along with institutional investors such as JIC Huake and DHLT Investment. The round’s lineup of nearly 10 investors has pushed the company to unicorn status.
Another player, ArcheBase, has been in business for less than five months but has already secured nearly 100 million yuan in signed orders. It builds data-compilation infrastructure connecting the physical world to robot model training and recently completed an angel round worth tens of millions of yuan. Notably, its shareholders are four embodied AI companies – a sign that embodied AI players and data platforms are binding themselves together more tightly.
Big internet names are also jumping in, each with a distinct playbook. Alibaba uses Alibaba Cloud as its technical base and runs an embodied data loop through Cainiao and Amap. JD.com leans on its retail and logistics scenarios to assemble data infrastructure and a compliant data trading platform. Baidu’s AI Cloud has teamed up with industry partners to build an embodied intelligence data supermarket.
In terms of positioning, Lightwheel AI and ArcheBase have carved out the two core segments of embodied data production: front-end synthetic generation and mid-end standardized processing, deepening the vertical infrastructure. Alibaba, JD.com and Baidu, by contrast, take an ecosystem approach, drawing on their existing strengths in industry scenarios, cloud services and open-source communities. Their lines of business complement the vertical specialists, collectively strengthening the data foundation for robot training.
Zooming out to a global view, Google DeepMind already runs Open X-Embodiment, the world’s largest open-source dataset of real-robot data, serving lab, home, industrial and outdoor scenarios. NVIDIA’s Isaac Sim provides a simulation platform where developers can handle robot modeling, algorithm validation and synthetic data generation inside photorealistic virtual environments.
Whether it’s the rise of focused tech firms or the entry of global giants, one thing is clear: embodied data platforms are becoming the infrastructure that connects data training to real-world deployment for the embodied AI industry.
As infrastructure improves, embodied AI companies are showing up to buy. Demand is currently concentrated among large-model teams, big tech firms at home and abroad, and robotics startups. They are grabbing data as fast as it is produced – “buying whatever is available” – which is sending platform revenues soaring.
Take Lightwheel AI. Its customer list is a who’s-who: NVIDIA, DeepMind, ByteDance, Alibaba and other top global large-model companies, plus leading robotics players such as GalBot and Agibot. Its 2025 revenue grew tenfold, and Q1 2026 revenue is already expected to surpass the whole of last year.
In Silicon Valley’s embodied AI circles, the name Lightwheel is coming up more and more often. Several investors believe the company combines a technology moat, ecosystem reach and commercial growth potential, making it a candidate to become the “common denominator” of embodied data.
Yet for all the momentum, the embodied data industry is still in its early days, and a set of pain points is holding it back.
Part 2: No good data, no embodied AI
Three main data pipelines are competing inside the industry: synthetic simulation data, real-world robot data and human demonstration data. The camps are now openly clashing.
The synthetic camp argues that real data is merely a “patch” for robot learning and that eventually only massive volumes of simulated data will truly evolve a robot’s brain. The real-data camp treats raw real-world data as the gold standard, the only way to move robots from the lab into real jobs. Another group records multimodal information – visuals, motion, force and haptics – as humans perform tasks in real environments, letting robots “learn on the shoulders of humans.” It promises low cost and high generalization in tackling the messiness of the physical world.
Beneath this route debate, the whole embodied data track faces a severe data drought.
First, the gap in high-quality embodied interaction data is enormous. Teaching a robot to pour milk requires at least 100 data samples capturing different angles and forces. The entire industry can currently supply one sample. To generalize across different cups, cartons and tilt angles, about a million such samples are needed – a gap of a hundred thousand times.
Second, robot data collection is painfully expensive. Training a robot to “hand a scalpel to a surgeon in a hospital” involves visual, tactile and motion-trajectory data. Acquiring such multimodal data can cost more than 1,000 times that of plain text, a burden small and medium-sized players simply cannot afford.
Third, data quality is often poor. Most existing data comes from controlled lab setups and lacks generalization. A warehouse robot that achieves a 98 percent grasp success rate in the lab can see that number drop below 50 percent in a real warehouse, thwarted by shadows and shelf clutter. It hardly makes for a reliable helper.
On top of that, a disconnect along the industry chain is squandering resources. Many embodied data service providers still follow the traditional big-data order-fulfillment logic – deliver the contracted volume, finish the labeling, job done – without understanding what feature data an embodied AI algorithm actually needs. An algorithm team wants force-distribution data for grasping different objects, but the data supplier hands over piles of repetitive object visuals. The order appears fulfilled, but nearly half the data is useless, burning cash and slowing down R&D across the entire industry.
Finally, data silos and compliance shackles are capping the industry’s overall efficiency. In a recent podcast with tech journalist Zhang Xiaojun, Lightwheel AI CEO Xie Chen noted that “most robotics companies keep their data locked inside their own systems. Different firms label and collect the same type of data repeatedly, wasting resources and making it difficult to assemble standardized, high-quality datasets that cover a full range of scenarios.”
When it comes to home services, medical care and other consumer-facing applications, the data itself contains privacy sensitivities. Rules around data rights and compliant circulation remain absent. Companies fear both regulatory trouble and leaking their competitive edge, so nobody opens up proprietary data. The silos only get taller.
It is time to fix these chronic issues and take the industry up a level.

Part 3: Three steps to fix embodied data
Players are already pulling out different weapons.
Among the internet giants, JD.com is a full-stack player building end-to-end data infrastructure and a trading platform, while Tencent leans more toward providing the underlying computing backbone as an ecosystem connector. Each plays to its strengths in constructing a computing base for robotics.
Vertical specialist Lightwheel AI uses virtual simulation to ease the high cost and limited scale of real-world robot data collection. ArcheBase has built its own data-compilation engine, aiming to fill data gaps, raise precision and support large-scale deployment.
On the robot-maker side, Agibot has tackled the scarcity of high-quality industrial-grade real-robot data by releasing a fully open-source dataset. More than 80 percent of the training data for a mainstream embodied model now comes from that dataset, giving industry-wide data sharing a real push.
Separately, guided by the Ministry of Industry and Information Technology and initiated by the OpenAtom Foundation, Leju Robot has joined forces with universities and enterprises to build a community for an open-source embodied intelligence dataset, pooling resources to tackle data silos and high collection costs.
Out of these moves, a clear three-step path is emerging. In the short term, adopt a hybrid model – synthetic data as the foundation, real-world data to fine-tune performance – and popularize lightweight, easy-to-use collection tools to rapidly bring down data costs. In the medium term, tear down data silos by jointly establishing unified data standards for different scenarios, cutting out useless duplicate effort at the source. Over the long run, build compliant data-sharing platforms with privacy-preserving computation and clear data rights, turning the money companies waste on duplicate infrastructure into shared public goods for the whole industry.
Embodied AI is now at a tipping point between lab demos and at-scale deployment. For a long time, the spotlight shone mostly on large models and robot hardware. Now it is time to strengthen the upstream infrastructure. This data-driven marathon hasn’t reached the finish line yet, but the day when general-purpose robots truly walk into daily life is worth waiting for.

