In this session, we explore the use of hybrid CNN-transformer architectures and CLIP models for highlight detection and semantic search from fish-eye video captures. We focus on leveraging a vision transformer (ViT) for video indexing, moment retrieval and identifying photogenic […]