The next evolution of intelligent document processing lies in combining traditional discriminative vision systems with the contextual understanding of vision‑language models (VLMs) fine‑tuned on domain‑specific document data. VLMs—originally developed as general‑purpose models—offer strong zero‑shot capabilities and semantic understanding, but their limitations in precision and resource efficiency, along with data privacy concerns when relying on external model services, make them unsuitable as stand‑alone replacements for production pipelines. A hybrid approach integrates VLMs alongside an optimized document pipeline that includes pre‑processing, layout analysis, OCR and handwriting recognition, document classification and field‑level semantic extraction. In this talk, we explore the design of an intelligent document processing system using this hybrid approach, balancing the determinism and efficiency of classical pipelines with the multimodal understanding of VLMs, and show how this approach enables document systems that are accurate, configurable by document family, resource‑optimized and adaptable to enterprise and on‑premise environments.

