In this session, we explore the use of hybrid CNN-transformer architectures and CLIP models for highlight detection and semantic search from fish-eye video captures. We focus on leveraging a vision transformer (ViT) for video indexing, moment retrieval and identifying photogenic frames on edge devices. By utilizing a small vision-language model, the system enables efficient video captioning and text-based semantic search in real time at the edge. We will delve into the architecture evolution of vision transformers, particularly the hybrid CNN encoder and transformer decoder, and contrast this approach with classical computer vision methods for frame understanding. Additionally, we will present a novel agentic system for predicting the aesthetic appeal of frames and creating shareable content. Finally, we’ll provide insights into model performance on edge compute platforms such as the NVIDIA Jetson and Qualcomm Snapdragon 8255.