Date: Wednesday, May 22
Start Time: 2:05 pm
End Time: 2:35 pm
Transformers are a class of neural network models originally designed for natural language processing. Transformers have also proven to be powerful for visual perception due to their exceptional ability to model long-range dependencies in images and process multimodal data. Resource constraints form a central challenge when deploying transformers on embedded platforms. Transformers demand substantial memory for parameters and intermediate computations due to their self-attention mechanisms. Further, the intricate computations involved in self-attention create challenging computation requirements. Energy efficiency adds another layer of complexity. Mitigating these challenges requires a multifaceted approach. Optimization techniques like quantization ameliorate memory constraints. Pruning and sparsity techniques, removing less critical connections, alleviate computation demands. Knowledge distillation and other techniques transfer knowledge from larger models to compact yet accurate models. In this, Shang-Hung will also discuss hardware accelerators such as NPUs customized for transformer workloads, and software techniques for efficiently mapping transformer models to hardware accelerators.