Natural language video search—“find the man in the green hoodie riding a bike”—can turn a camera network from passive recording into a proactive investigation tool. But brute-force approaches fail: cloud processing of streaming video is cost- and bandwidth-prohibitive at scale, while running a VLM on every frame on device explodes power and latency budgets. This talk presents an implementation case study of a hierarchical pipeline that makes natural language video search practical across multiple edge platforms. We combine a 30-fps detector (for gatekeeping and attribute filtering) with distributed semantic scoring to promote only relevant regions, then run a VLM “reasoner” selectively on cropped regions of interest to produce high-precision descriptions. We’ll cover the overall solution architecture, model choices and trade-offs, implementation pitfalls and resource management techniques (model residency, zero-copy, ring buffers). Attendees will leave with a reference architecture and concrete optimization tactics for deployable, real-time VLM-based edge systems.

