Video Understanding: Vision Transformers for Highlight and Moment Retrieval on the Edge

In this session, we explore the use of hybrid CNN-transformer architectures and CLIP models for highlight detection and semantic search from fish-eye video captures. We focus on leveraging a vision transformer (ViT) for video indexing, moment retrieval and identifying photogenic frames on edge devices. By utilizing a small vision-language model, the system enables efficient video captioning and text-based semantic search in real time at the edge. We will delve into the architecture evolution of vision transformers, particularly the hybrid CNN encoder and transformer decoder, and contrast this approach with classical computer vision methods for frame understanding. Additionally, we will present a novel agentic system for predicting the aesthetic appeal of frames and creating shareable content. Finally, we’ll provide insights into model performance on edge compute platforms such as the NVIDIA Jetson and Qualcomm Snapdragon 8255.

Track

Technical Insights

Session Speakers

Ramit Pahwa
AI Scientist, Rivian

Ramit Pahwa is an AI Scientist on the Computer Vision team at Rivian. He has extensive experience building on-device machine learning solutions and 0-to-1 generative AI features. Ramit recently obtained his master’s degree in Computer Science from the University of Texas at Austin. Previously, he graduated at the top of his class from the Indian Institute of Technology Kharagpur and spent a few years as a Research Engineer at Adobe.

Video Understanding: Vision Transformers for Highlight and Moment Retrieval on the Edge

Track

Session Speakers

Ramit Pahwa

See you May 20-22, 2025, at the Santa Clara Convention Center!

Sponsors & Exhibitors

Get in Touch

Share