Move over large language models — the new frontier in AI is world models that can understand and simulate reality.
Why it matters: Such models are key to creating useful AI for everything from robotics to video games.
- For all the book smarts of LLMs, they currently have little sense for how the real world works.
Driving the news: Some of the biggest names in AI are working on world models, including Fei-Fei Li whose World Labs announced Marble, its first commercial release.
- Machine learning veteran Yann LeCun reportedly plans to launch a world model startup when he leaves Meta in the coming months.
- Google and Meta are also developing world models, both for robotics and to make their video models more realistic.
- Meanwhile, OpenAI has posited that building better video models could also be a pathway toward a world model.
As with the broader AI race, it's also a global battle.
- Chinese tech companies, including Tencent, are developing world models that include an understanding of both physics and three-dimensional data.
- Last week, the United Arab Emirates-based Mohamed bin Zayed University of Artificial Intelligence, a growing player in AI, announced PAN, its first world model.
What they're saying: "I've been not making friends in various corners of Silicon Valley, including at Meta, saying that within three to five years, this [world models, not LLMs] will be the dominant model for AI architectures, and nobody in their right mind would use LLMs of the type that we have today," LeCun said last month at a symposium at the Massachusetts Institute of Technology, as noted in a Wall Street Journal profile.
How they work: World models learn by watching video or digesting simulation data and other spatial inputs, building internal representations of objects, scenes and physical dynamics.
- Instead of predicting the next word, as a language model does, they predict what will happen next in the world, modeling how things move, collide, fall, interact and persist over time.
- The goal is to create models that understand concepts like gravity, occlusion, object permanence and cause-and-effect without having been explicitly programmed on those topics.
Context: There's a similar but related concept called a "digital twin" where companies create a digital version of a specific place or environment, often with a flow of real-time data for sensors allowing for remote monitoring or maintenance predictions.
Between the lines: Data is one of the key challenges. Those building large language models have been able to get most of what they need by scraping the breadth of the internet.
- World models also need a massive amount of information, but from data that's not consolidated or as readily available.
- "One of the biggest hurdles to developing world models has been the fact that they require high-quality multimodal data at massive scale in order to capture how agents perceive and interact with physical environments," Encord president and co-founder Ulrik Stig Hansen said in an email interview.
- Encord offers one of the largest open source datasets for world models, with 1 billion data pairs across images, videos, text, audio and 3D point clouds as well as a million human annotations assembled over months.
- But even that is just a baseline, Hansen said. "Production systems will likely need significantly more."
What we're watching: While world models are clearly needed for a variety of uses, whether they can advance as rapidly as language models remains uncertain.
- Though clearly they're benefiting from a fresh wave of interest and investment.