Date: Monday, May 11
Start Time: 4:50 pm
End Time: 5:20 pm
Today’s edge AI hardware was built for CNNs, but vision language models (VLMs) have completely different bottlenecks—especially in safety-critical, latency-sensitive applications like in-cabin automotive intelligence. While CNN inference is stateless, parallel and compute-bound, VLMs introduce a large, growing KV cache and a sequential, memory-bound decode phase that quickly overwhelms traditional NPU memory subsystems. In this talk, we explain why conventional TOPS-focused designs fail for edge VLM workloads and outline a new approach that combines model optimization, attention-aware cache hierarchies and disaggregated architectures tuned separately for prefill and decode. Attendees will learn how hardware-software co-design and memory-centric architectures can unlock order-of-magnitude gains in latency and efficiency for next-generation embedded vision systems.

