The AI industry spent the last decade scaling up. The next decade will be about scaling down: AI that runs on your device, reasons across what you see and hear and understands by utilizing context that never leaves your pocket.
Drawing on our recent work, I’ll show how to make this shift algorithmically. Extreme quantization down to sub-2-bit precision doesn’t just shrink models; it forces them into entirely new representations. KV cache compression and speculative decoding make interactive latency possible within mobile power budgets. Lightweight vision models bring real-time understanding to cameras. And temporal compression lets models process hours of video on constrained hardware.
These techniques unlock new experiences. Sub-billion parameter reasoning models will power private contextual assistants that understand your routines and daily trade-offs without anything leaving your device. Efficient vision encoders will enable real-time segmentation and object tracking for live video effects, and metric depth estimation on glasses for AR and spatial understanding. Temporal compression will make on-device video search over personal recordings practical.
The AI of choice won’t be the biggest model. It will be the one that’s always there.

