March 17, 2025

ikayaniaamirshahzad@gmail.com

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models


Artificial Neural Networks (ANNs) have revolutionized computer vision with great performance, but their “black-box” nature creates significant challenges in domains requiring transparency, accountability, and regulatory compliance. The opacity of these systems hampers their adoption in critical applications where understanding decision-making processes is essential. Scientists are curious to understand these models’ internal mechanisms and want to utilize these insights for effective debugging, model improvement, and exploring potential parallels with neuroscience. These factors have catalyzed the rapid development of explainable artificial intelligence (XAI) as a dedicated field. It focuses on the interpretability of ANNs, bridging the gap between machine intelligence and human understanding.

Concept-based methods are powerful frameworks among XAI approaches for revealing intelligible visual concepts within ANNs’ complex activation patterns. Recent research characterizes concept extraction as dictionary learning problems, where activations map to a higher-dimensional, sparse “concept space” that is more interpretable. Techniques like Non-negative Matrix Factorization (NMF) and K-Means are used to accurately reconstruct original activations, while Sparse Autoencoders (SAEs) have recently gained prominence as powerful alternatives. SAEs achieve an impressive balance between sparsity and reconstruction quality but suffer from instability. Training identical SAEs on the same data can produce different concept dictionaries, limiting their reliability and interpretability for meaningful analysis.

Researchers from Harvard University, York University, CNRS, and Google DeepMind have proposed two novel variants of Sparse Autoencoders to address the instability issues: Archetypal-SAE (A-SAE) and its relaxed counterpart (RA-SAE). These approaches build upon archetypal analysis to enhance stability and consistency in concept extraction. The A-SAE model constrains each dictionary atom to reside strictly within the convex hull of the training data, which imposes a geometric constraint that improves stability across different training runs. The RA-SAE extends this framework further by incorporating a small relaxation term, allowing for slight deviations from the convex hull to enhance modeling flexibility while maintaining stability.

The researchers evaluate their approach using five vision models: DINOv2, SigLip, ViT, ConvNeXt, and ResNet50, all obtained from the timm library. They construct overcomplete dictionaries with sizes five times the feature dimension (e.g., 768×5 for DINOv2 and 2048×5 for ConvNeXt), providing sufficient capacity for concept representation. The models undergo training on the entire ImageNet dataset, processing approximately 1.28 million images that generate over 60 million tokens per epoch for ConvNeXt and more than 250 million tokens for DINOv2, continuing for 50 epochs. Moreover, RA-SAE builds upon a TopK SAE architecture to maintain consistent sparsity levels across experiments. The computation of a matrix involves K-Means clustering of the entire dataset into 32,000 centroids.

The results demonstrate significant performance differences between traditional approaches and the proposed methods. Classical dictionary learning algorithms and standard SAEs show comparable performance but struggle to recover true generative factors in the tested datasets accurately. In contrast, RA-SAE achieves higher accuracy in recovering underlying object classes across all synthetic datasets used in the evaluation. In qualitative results, RA-SAE uncovers meaningful concepts, including shadow-based features linked to depth reasoning, context-dependent concepts like “barber”, and fine-grained edge detection capabilities in flower petals. Moreover, it learns more structured within-class distinctions than TopK-SAEs, separating features like rabbit ears, faces, and paws into distinct concepts rather than mixing them.

In conclusion, researchers have introduced two variants of Sparse Autoencoders: A-SAE and its relaxed counterpart RA-SAE. A-SAE constrains dictionary atoms to the convex hull of the training data and enhances stability while preserving expressive power. Then, RA-SAE effectively balances reconstruction quality with meaningful concept discovery in large-scale vision models. To evaluate these approaches, the team developed novel metrics and benchmarks inspired by identifiability theory, providing a systematic framework for measuring dictionary quality and concept disentanglement. Beyond computer vision, A-SAE establishes a foundation for more reliable concept discovery across broader modalities, including LLMs and other structured data domains.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.


Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.



Source link

Leave a Comment