Vision-language models (VLMs) enable “world understanding” tasks—natural language queries, zero-shot reasoning and richer scene context—that go beyond traditional CNN pipelines. But bringing VLMs to the edge is primarily a systems problem: selecting a model you can govern (license and data provenance), adapting it efficiently (prompting, quantization, LoRA-style fine-tuning, preference optimization) and fitting it into real device constraints where bandwidth and memory—not TOPS—determine feasibility. This talk frames VLM deployment as an accuracy-latency-cost trade-off, then provides a practical decision framework for matching VLM architectures (vision encoder + projector + language model) to edge hardware. Attendees will leave with a selection matrix for NPUs vs. GPUs vs. hybrid pipelines, plus concrete rules of thumb for memory budgeting and throughput bottlenecks in real deployments.

