Where: Mission City M1-M3
Date: Day 2
Start Time: 1:35 pm
End Time: 2:05 pm
Convolutional neural networks have made tremendous strides in object detection and recognition in recent years. However, extending the CNN approach to understanding of video or volumetric data poses tough challenges, including trade-offs between representation quality and computational complexity, which is of particular concern on embedded platforms with tight computational budgets. This presentation explores the use of CNNs for video understanding. We review the evolution of deep representation learning methods involving spatio-temporal fusion from C3D to Conv-LSTMs for vision-based human activity detection. We propose a decoupled alternative to this fusion, proposing an approach that combines a low-complexity predictive temporal segment proposal model and a fine-grained (perhaps high-complexity) inference model. We find that this hybrid approach, in addition to reducing computational load with minimal loss of accuracy, enables effective solutions to these high complexity inference tasks.