Date: Monday, May 11
Start Time: 9:35 am
End Time: 10:25 am
LLMs like OpenAI’s GPTs have fascinated the public with their astounding capabilities on a wide range of tasks, such as genius-level standard test performance, Olympiad-level math reasoning, encyclopedic Q&A and human-like conversation ability. However, LLMs are only “book intelligent”; they suffer from inherent limitations for embodied, physical and social reasoning—and for strategic planning—in the real world.
In this talk, we’ll explain how to build a true world model, rather than a video generator, to simulate all actionable possibilities of the real world for purposeful reasoning and planning via thought experiment rather than mere pixel realism. We will also explain how to build a true agent model, rather than an LLM wrapper or software pipeline, to be able to learn and act with the flexibility, adaptability and autonomy associated with natural agents such as humans and with the intrinsic abilities of self-regulation, reflection, collaboration and socialization, rather than merely reacting to exogenous stimuli. We propose a Generative Latent Prediction architecture for world modeling that builds on stateful latent space, long-horizon and closed-loop action-conditioned latent reasoning and learning/inferencing grounded over realizable world states. In addition, we propose a Goal-Identity-Configurator architecture for agent modeling that can regulate reasoning modes between unconsciously reactive and consciously deliberative, generate real-world actions based on its goals and identity and self-learn off-line from the world model via reinforcement learning.
Finally, we will present PAN, a physical, agentive and nested framework over the proposed architectures that brings together perception, state, action and causality within one system to support open-domain, interactable world simulation and agentive intelligence. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting and simulative reasoning compared to other video generators, world models and agentic systems.

