Traditional computer vision excels at answering “what is in the image,” but many real systems need “what is happening” and “what will happen next.” World models address this by maintaining an internal, temporally consistent representation of the environment and using it to predict future states. This talk introduces world models at a conceptual level—state, memory, dynamics and rollout—then grounds the discussion in practical computer vision use cases, including autonomous driving, robotic manipulation, activity understanding and simulation-based planning. We’ll cover common architecture patterns (perception backbones, latent state representations, temporal modeling with RNNs, transformers and state-space models) and why world models are typically systems rather than single monolithic networks. We’ll also explain how multi-sensor fusion (camera, LiDAR, radar, IMU) enables more robust world representations and why current deployment realities force hybrid edge/cloud designs. Attendees will leave with a clear understanding of world models and emerging directions for creating and using them.

