Date: Wednesday, May 18 (Main Conference Day 2)
Start Time: 10:50 am
End Time: 11:20 am
In computer vision, hierarchical structures are popular in vision transformers (ViT). In this talk, we present a novel idea of nesting canonical local transformers on non-overlapping image blocks and aggregating them hierarchically. This new design, named NesT, leads to a simplified architecture compared with existing hierarchical structured designs, and requires only minor code changes relative to the original ViT. The benefits of the proposed judiciously-selected design are threefold: (1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets; (2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8X faster than previous transformer-based generators; and (3) decoupling the feature learning and abstraction processes via the nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model.