Vision-focused edge AI, where CNNs ruled, was compute-bound. However, today’s vision-language models (VLMs) have flipped the script: they feature as many as 70B parameters and 200K+ token contexts and are highly memory-bound, shifting the bottleneck from compute to memory. KV caches overwhelm SRAM, data movement demands choke memory bandwidth and traditional NPUs fail. In this talk, we dissect CNN vs. VLM architectural differences and memory bottlenecks that hinder edge deployment. Further, we address solutions such as advanced cache hierarchies (rolling state, tiered SRAM→DRAM) and disaggregated memory (NPU-to-system RAM offload), which enable production VLMs.

