Most vision AI systems that work in the lab fail when deployed across hundreds of real-world locations. At Intuitivo, we operate 2,000+ autonomous points of purchase across enterprise locations, including Tesla, Microsoft, Walmart and Bank of America facilities, processing over 50 million images daily with a camera-only approach and no high-cost sensors. In this talk we will share the engineering journey behind HERMES (Hierarchical Encoding and Reasoning Model for Event-Based Shopcarts), the end-to-end vision model that replaced our modular inference pipeline (object detection, tracking, heuristics) with a single trainable stage that fuses multi-camera video into transaction understanding.

