We’re all trying to move LLM-class intelligence onto edge cameras and sensors so video can be understood—and acted on—in real time. In this talk, we’ll explain why this is hard by contrasting the compute and memory behavior of CNNs versus LLMs/VLMs. CNNs exploit locality, weight sharing and predictable memory reuse, so performance scales roughly linearly with image size and maps cleanly onto today’s NPUs. Transformers shift the bottleneck to quadratic attention (O[N²] compute and memory), high-bandwidth KV traffic and large, dense matmuls—turning memory pipelines, not TOPS, into the limiting factor for latency, power and cost. We’ll then survey practical paths forward: quantized small models, hybrid CNN-transformer pipelines, hardware-aware attention (e.g., FlashAttention and sparsity) and constrained task-specific models that reduce sequence length or operate in compressed domains. We’ll close with survey results on real-world use cases, chipset readiness and deployment pain points.

