View a PDF of the paper titled An Image is Worth More Than 16×16 Patches: Exploring Transformers on Individual Pixels, by Duy-Kien Nguyen and 5 other authors
Abstract:This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias of locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16×16 patch as a token). We showcase the effectiveness of pixels-as-tokens across three well-studied computer vision tasks: supervised learning for classification and regression, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although it’s computationally less practical to directly operate on individual pixels, we believe the community must be made aware of this surprising piece of knowledge when devising the next generation of neural network architectures for computer vision.
Submission history
From: Duy-Kien Nguyen [view email]
[v1]
Thu, 13 Jun 2024 17:59:58 UTC (4,157 KB)
[v2]
Thu, 13 Mar 2025 19:12:25 UTC (10,900 KB)