Tracing the thoughts of a large language model. In a follow-up to the research that brought us the delightful Golden Gate Claude last year, Anthropic have published two new papers about LLM interpretability:
To my own personal delight, neither of these papers are published as PDFs. They’re both presented as glorious mobile friendly HTML pages with linkable sections and even some inline interactive diagrams. More of this please!