March 5, 2025
Partially rewriting an LLM in natural language
Our most recent work on using sparse autoencoders (SAEs) focused on automatically generating natural language interpretations for their latents and evaluating how good they are. If all the latents were interpretable, we could use the interpretations to simulate the latent activations, replacing the SAE encoder with an LLM and a natural language prompt. We should then be able to patch the activations generated by this natural language simulation back into the model and get nearly identical behavior to the original. In the limit, we could effectively “rewrite” the entire model in terms of interpretable features and interpretable operations on those