Visual AI is now central to many real-time and safety-critical embedded systems. In this introductory talk, we explain the foundation of how deep learning enables modern computer vision, starting with what images and video look like to a machine (tensors across space, channels and time). We’ll contrast classical feature engineering with learned representations, explain core neural network concepts and training at an intuitive level and then survey the major model families (fully connected networks, CNNs, RNNs/LSTMs, transformers and hybrids), highlighting the assumptions each makes about data. We’ll then dive deep into CNNs: convolutions, receptive fields, pooling, normalization and the feature hierarchies they learn, plus their evolution from VGG to ResNet to efficient mobile architectures. Attendees will leave with a practical mental model of vision neural network trade-offs and a road map of where visual AI is heading.

