Are Patches All You Need? New study suggests patches are behind strong performance of vision transformers


Vision transformer (ViT) architectures have achieved convincing performance in many computer vision tasks, often outperforming conventional convolutional architectures. A question arises: Is the impressive performance of ViTs due to their powerful transformer architecture and attention mechanisms, or is there some other factor that gives ViTs their advantage?

In the newspaper Patches are all you need, which is currently undergoing a double-blind evaluation for the International Conference on Learning Representations (ICLR 2022), a research team is proposing ConvMixer, an extremely simple model (approximately 6 lines of dense PyTorch code) designed to support the hypothesis that ViT performance is primarily attributable to the use of patches as the input representation. The study shows that ConvMixer can outperform ViT, MLP-Mixers and classic vision models.

Tesla Senior AI Director Andrej Karpathy tweeted: “I am blown away by the new ConvMixer architecture.”

ConvMixer includes a patch integration layer followed by repeated applications of a single fully convolutional block. The ConvMixer block itself consists of a deep convolution followed by a point convolution, each of the convolutions being followed by an activation and post-activation BatchNorm process.

As the name suggests, the general idea behind ConvMixer is mixing. Researchers use deep convolution to mix spatial locations and point convolution to mix channel locations. They also use convolutions with an unusually large kernel size to mix distant spatial locations, allowing them to observe the effects of the patch representation itself unlike the classic pyramid-shaped design of convolutional networks.

In their empirical study, the team evaluated ConvMixer on the ImageNet-1k classification without any prior training or additional data. They added ConvMixer to the timm framework and used RandAugment, mixup, CutMix, random erase and gradient standard clipping in addition to the default timm increase.

In the experiments, a ConvMixer-1536/20 with 52M of parameters achieved 81.4% top-1 accuracy on ImageNet, and a ConvMixer-768/32 with 21M of parameters achieved 80.2%. In addition, despite its extreme simplicity, ConvMixer has outperformed both “standard” computer vision models such as ResNet and the corresponding variants of the Vision Transformer and MLP-Mixer.

Overall, the results suggest that the patch representation itself may be the component most responsible for the outstanding performance of ViTs. The team believes their work can provide a solid “convolutional but patch-based” baseline against which to compare future advanced architectures.

The ConvMixer code is available on the project site Github. The paper Are Patches All You Need? is on OpenReview.

Author: Hecate Il | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Weekly Synchronized Global AI to get weekly AI updates.

Source link

Leave A Reply

Your email address will not be published.