Vision-language models (VLMs) are moving into practical computer vision applications, but we’ve found that model accuracy benchmark scores rarely translate to production success—especially in real-world and edge deployments. In this talk, we’ll walk through the failure modes we see most often: sensor noise, lighting variation and environmental drift; domain shift between training and deployment; hallucinations and weak visual grounding; and rare, out-of-distribution events that break seemingly robust systems. We’ll then cover the system-level choices that determine outcomes, including camera selection, data collection, curation and annotation practices, multimodal fusion with auxiliary sensors and when to combine classical vision and signal processing with learned models. Next, we’ll share deployment techniques we use in practice: quantization, pruning, distillation and size/speed/power trade-offs. Finally, we’ll show how we evaluate deployed VLMs with task metrics, stress tests and production monitoring to detect drift, concept shift and sensor degradation, and how we close the loop with targeted retraining and model updates.

