ICML 2024PastInterpretability

ICML 2024 Workshop on Mechanistic Interpretability

ICML 2024 MI Workshop

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: May 30, 2024, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (93)

Fetched from OpenReview (v2) on 2026-06-10.

Adversarial Circuit Evaluation
Niels uit de Bos, Adrià Garriga-Alonso · PDF
An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L
Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau · PDF
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Chee Hian Tan, David Chanin, Aengus Lynch, Adrià Garriga-Alonso, Dimitrios Kanoulas, Brooks Paige, Robert Kirk · PDF
Attention with Markov: A Curious Case of Single-layer Transformers
Ashok Vardhan Makkuva, Marco Bondaschi, Alliot Nagle, Adway Girish, Hyeji Kim, Martin Jaggi, Michael Gastpar · PDF
Automatically Identifying Local and Global Circuits with Linear Computation Graphs
Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu · PDF
Benchmarking Mental State Representations in Language Models
Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling · PDF
Challenges in Mechanistically Interpreting Model Representations
Satvik Golechha, James Dao · PDF
Cluster-Norm for Unsupervised Probing of Knowledge
Walter Laurito, Sharan Maiya, Grégoire DHIMOÏLA, Owen Ho Wan Yeung, Kaarel Hänni · PDF
Comgra: A Tool for Analyzing and Debugging Neural Networks
Florian Dietz, Sophie Fellenz, Dietrich Klakow, Marius Kloft · PDF
Compact Proofs of Model Performance via Mechanistic Interpretability
Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan · PDF
Confidence Regulation Neurons in Language Models
Alessandro Stolfo, Ben Peng Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda · PDF
Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
Yoann Poupart · PDF
Controlling Large Language Model Agents with Entropic Activation Steering
Nate Rahn, Pierluca D'Oro, Marc G Bellemare · PDF
CoSy: Evaluating Textual Explanations of Neurons
Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina MC Höhne, Kirill Bykov · PDF
Crafting Large Language Models for Enhanced Interpretability
Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng · PDF
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi · PDF
Delay Embedding Theory of Neural Sequence Models
Mitchell Ostrow, Adam Joseph Eisen, Ila R Fiete · PDF
Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models
Nicholas Bai, Rahul Ajay Iyer, Tuomas Oikarinen, Tsui-Wei Weng · PDF
Dissecting Query-Key Interaction in Vision Transformers
Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz · PDF
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Yibo Jiang, Goutham Rajendran, Pradeep Kumar Ravikumar, Bryon Aragam · PDF
Does Editing Provide Evidence for Localization?
Zihao Wang, Victor Veitch · PDF
Exploring the Internal Mechanisms of Music LLMs: A Study of Root and Quality via Probing and Intervention Techniques
Wenye Ma, Gus Xia · PDF
Extracting Finite State Machines from Transformers
Rik Adriaensen, Jaron Maene · PDF
Faithful and Fast Influence Function via Advanced Sampling
Jungyeon Koh, Hyeonsu Lyu, Jonggyu Jang, Hyun Jong Yang · PDF
Finding Visual Task Vectors
Alberto Hojel, Yutong Bai, Trevor Darrell, Amir Globerson, Amir Bar · PDF
From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport
Quentin Bouniot, Ievgen Redko, Anton Mallasto, Charlotte Laclau, Oliver Struckmeier, Karol Arndt, Markus Heinonen, Ville Kyrki, Samuel Kaski · PDF
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Boshi Wang, Xiang Yue, Yu Su, Huan Sun · PDF
Grokking and the Geometry of Circuit Formation
Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk · PDF
Grokking, Rank Minimization and Generalization in Deep Learning
David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Henrique Pamplona Savarese, Gal Vardi, Karen Livescu, Michael Maire, Matthew Walter · PDF
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Michael Hanna, Sandro Pezzelle, Yonatan Belinkov · PDF
How do Llamas process multilingual text? A latent exploration through activation patching
Clément Dumas, Veniamin Veselovsky, Giovanni Monea, Robert West, Chris Wendler · PDF
How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator
Subhash Kantamneni, Ziming Liu, Max Tegmark · PDF
How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion
Pulkit Gopalani, Ekdeep Singh Lubana, Wei Hu · PDF
How Truncating Weights Improves Reasoning in Language Models
Lei Chen, Joan Bruna, Alberto Bietti · PDF
Hypothesis Testing the Circuit Hypothesis in LLMs
Claudia Shi, Nicolas Beltran-Velez, Achille Nazaret, Carolina Zheng, Adrià Garriga-Alonso, Andrew Jesson, Maggie Makar, David Blei · PDF
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey · PDF
Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda · PDF
Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition
Kenzo Clauw, Daniele Marinazzo, Sebastiano Stramaglia · PDF
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso · PDF
Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities
Nhat Le, Ciyue Shen, Chintan Shah, Blake Martin, Daniel Shenker, Harshith Padigela, Jennifer A. Hipp, Sean Grullon, John Abel, Harsha Vardhan pokkalla, Dinkar Juyal · PDF
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent
Karolis Jucys, George Adamopoulos, Mehrab Hamidi, Stephanie Milani, Mohammad Reza Samsami, Artem Zholus, Sonia Joseph, Blake Aaron Richards, Irina Rish, Özgür Şimşek · PDF
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda · PDF
InversionView: A General-Purpose Method for Reading Information from Neural Activations
Xinting Huang, Madhur Panwar, Navin Goyal, Michael Hahn · PDF
Investigating the Indirect Object Identification circuit in Mamba
Danielle Ensign, Adrià Garriga-Alonso · PDF
Investigating the Interpretability of Biometric Face Templates Using Gated Sparse Autoencoders and Differentiable Image Parametrizations
Peter Rot, Klemen Grm · PDF
Is Transformer a Stochastic Parrot? A Case Study in Simple Arithmetic Task
Peixu Wang, Chen Yu, Yu Ming · PDF
Iteration Head: A Mechanistic Study of Chain-of-Thought
Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Xingyu Alice Yang, Francois Charton, Julia Kempe · PDF
Language Models Linearly Represent Sentiment
Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda · PDF
Learning and Unlearning of Fabricated Knowledge in Language Models
Chen Sun, Nolan Andrew Miller, Andrey Zhmoginov, Max Vladymyrov, Mark Sandler · PDF
Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically
Kabir Ahuja, Vidhisha Balachandran, Madhur Panwar, Tianxing He, Noah A. Smith, Navin Goyal, Yulia Tsvetkov · PDF
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks
Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov · PDF
LLM Circuit Analyses Are Consistent Across Training and Scale
Curt Tigges, Michael Hanna, Qinan Yu, Stella Biderman · PDF
Localizing Auditory Concepts in CNNs
Pratyaksh Gautam, Makarand Tapaswi, Vinoo Alluri · PDF
Logical Distillation of Graph Neural Networks
Alexander Pluska, Pascal Welke, Thomas Gärtner, SAGAR MALHOTRA · PDF
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien, Eric Winsor · PDF
Loss in the Crowd: Hidden Breakthroughs in Language Model Training
Sara Kangaslahti, Elan Rosenfeld, Naomi Saphra · PDF
Manipulating Feature Visualizations with Gradient Slingshots
Dilyara Bareeva, Marina MC Höhne, Alexander Warnecke, Lukas Pirch, Klaus Robert Muller, Konrad Rieck, Kirill Bykov · PDF
Mathematical Models of Computation in Superposition
Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan · PDF
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks · PDF
Mechanistic Interpretability of Binary and Ternary Transformer Networks
Jason Li · PDF
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
Aaron Mueller · PDF
Modularity in Biologically Inspired Representations Depends on Task Variable Range Independence
Will Dorrell, Kyle Hsu, Luke Hollingsworth, Jin Hwa Lee, Jiajun Wu, Chelsea Finn, Peter E. Latham, Timothy Edward John Behrens, James C. R. Whittington · PDF
Neuroplasticity and Corruption in Model Mechanisms: A case study of Indirect Object Identification
Vishnu Kabir Chhabra, Ding Zhu, Mohammad Mahdi Khalili · PDF
On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task
Javier Ferrando, Marta R. Costa-jussà · PDF
Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data
Daniel D. Johnson · PDF
Planning behavior in a recurrent neural network that plays Sokoban
Adrià Garriga-Alonso, Mohammad Taufeeque, Adam Gleave · PDF
Progressive distillation improves feature learning via implicit curriculum
Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel · PDF
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda · PDF
Relational Composition in Neural Networks: A Survey and Call to Action
Martin Wattenberg, Fernanda Viégas · PDF
ReLU MLPs Can Compute Numerical Integration: Mechanistic Interpretation of a Non-linear Activation
Chun Hei Yip, Rajashree Agrawal, Jason Gross · PDF
Representing Rule-based Chatbots with Transformers
Dan Friedman, Abhishek Panigrahi, Danqi Chen · PDF
Robust Unlearning via Mechanistic Localizations
Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite · PDF
Segmentation CNNs are denoising models
Luis A. Zavala-Mondragón, Ruud Van Sloun, Peter H.N. de With, Fons van der Sommen · PDF
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Difan Zou, Yisong Yue, Ziniu Hu · PDF
Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task
Aleksandar Makelov · PDF
Survival of the Fittest Representation: A Case Study with Modular Addition
Xiaoman Delores Ding, Zifan Carl Guo, Eric J Michaud, Ziming Liu, Max Tegmark · PDF
Tackling Polysemanticity with Neuron Embeddings
Alex Foote · PDF
The Concept Percolation Hypothesis: Analyzing the Emergence of Capabilities in Neural Networks Trained on Formal Grammars
Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert P. Dick, Hidenori Tanaka · PDF
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park, Yo Joong Choe, Yibo Jiang, Victor Veitch · PDF
The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision
Liv Gorton · PDF
The Remarkable Robustness of LLMs: Stages of Inference?
Vedang Lad, Wes Gurnee, Max Tegmark · PDF
Tokenized SAEs: Disentangling SAE Reconstructions
Thomas Dooms, Daniel Wilhelm · PDF
TracrBench: Generating Interpretability Testbeds with Large Language Models
Hannes Thurnherr, Jérémy Scheurer · PDF
Transcoders find interpretable LLM feature circuits
Jacob Dunefsky, Philippe Chlenski, Neel Nanda · PDF
Transformers on Markov data: Constant depth suffices
Nived Rajaraman, Marco Bondaschi, Ashok Vardhan Makkuva, Kannan Ramchandran, Michael Gastpar · PDF
Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models
Sunny Duan, Mikail Khona, Abhiram Iyer, Rylan Schaeffer, Ila R Fiete · PDF
Understanding Counting in Small Transformers: The Interplay between Attention and Feed-Forward Layers
Freya Behrens, Luca Biggio, Lenka Zdeborova · PDF
Understanding Inhibition through Maximally Tense Images
Christopher J Hamblin, Srijani Saha, Talia Konkle, George A. Alvarez · PDF
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn · PDF
Visualizing Neural Network Imagination
Nevan Wichers, Victor Tao, Riccardo Volpato, Fazl Barez · PDF
Weight-based Decomposition: A Case for Bilinear MLPs
Michael T Pearce, Thomas Dooms, Alice Rigg · PDF
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip Torr, Amartya Sanyal, Puneet K. Dokania · PDF
Why do recurrent neural networks suddenly learn? Bifurcation mechanisms in neuro-inspired short-term memory tasks
Udith Haputhanthri, Liam Storan, Yiqi Jiang, Adam Shai, Hakki Orhun Akengin, Mark Schnitzer, Fatih Dinc, Hidenori Tanaka · PDF

Accepted papers (93)

☆Adversarial Circuit Evaluation

☆An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

☆Analyzing the Generalization and Reliability of Steering Vectors

☆Attention with Markov: A Curious Case of Single-layer Transformers

☆Automatically Identifying Local and Global Circuits with Linear Computation Graphs

☆Benchmarking Mental State Representations in Language Models

☆Challenges in Mechanistically Interpreting Model Representations

☆Cluster-Norm for Unsupervised Probing of Knowledge

☆Comgra: A Tool for Analyzing and Debugging Neural Networks

☆Compact Proofs of Model Performance via Mechanistic Interpretability

☆Confidence Regulation Neurons in Language Models

☆Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

☆Controlling Large Language Model Agents with Entropic Activation Steering

☆CoSy: Evaluating Textual Explanations of Neurons

☆Crafting Large Language Models for Enhanced Interpretability

☆Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

☆Delay Embedding Theory of Neural Sequence Models

☆Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models

☆Dissecting Query-Key Interaction in Vision Transformers

☆Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

☆Does Editing Provide Evidence for Localization?

☆Exploring the Internal Mechanisms of Music LLMs: A Study of Root and Quality via Probing and Intervention Techniques

☆Extracting Finite State Machines from Transformers

☆Faithful and Fast Influence Function via Advanced Sampling

☆Finding Visual Task Vectors

☆From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport

☆Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

☆Grokking and the Geometry of Circuit Formation

☆Grokking, Rank Minimization and Generalization in Deep Learning

☆Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

☆How do Llamas process multilingual text? A latent exploration through activation patching

☆How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator

☆How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion

☆How Truncating Weights Improves Reasoning in Language Models

☆Hypothesis Testing the Circuit Hypothesis in LLMs

☆Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

☆Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders

☆Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition

☆InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

☆Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities

☆Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

☆Interpreting Attention Layer Outputs with Sparse Autoencoders

☆InversionView: A General-Purpose Method for Reading Information from Neural Activations

☆Investigating the Indirect Object Identification circuit in Mamba

☆Investigating the Interpretability of Biometric Face Templates Using Gated Sparse Autoencoders and Differentiable Image Parametrizations

☆Is Transformer a Stochastic Parrot? A Case Study in Simple Arithmetic Task

☆Iteration Head: A Mechanistic Study of Chain-of-Thought

☆Language Models Linearly Represent Sentiment

☆Learning and Unlearning of Fabricated Knowledge in Language Models

☆Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically

☆Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

☆LLM Circuit Analyses Are Consistent Across Training and Scale

☆Localizing Auditory Concepts in CNNs

☆Logical Distillation of Graph Neural Networks

☆Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

☆Loss in the Crowd: Hidden Breakthroughs in Language Model Training

☆Manipulating Feature Visualizations with Gradient Slingshots

☆Mathematical Models of Computation in Superposition

☆Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

☆Mechanistic Interpretability of Binary and Ternary Transformer Networks

☆Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks

☆Modularity in Biologically Inspired Representations Depends on Task Variable Range Independence

☆Neuroplasticity and Corruption in Model Mechanisms: A case study of Indirect Object Identification

☆On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task

☆Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data

☆Planning behavior in a recurrent neural network that plays Sokoban

☆Progressive distillation improves feature learning via implicit curriculum

☆Refusal in Language Models Is Mediated by a Single Direction

☆Relational Composition in Neural Networks: A Survey and Call to Action

☆ReLU MLPs Can Compute Numerical Integration: Mechanistic Interpretation of a Non-linear Activation

☆Representing Rule-based Chatbots with Transformers

☆Robust Unlearning via Mechanistic Localizations

☆Segmentation CNNs are denoising models

☆Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

☆Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task

☆Survival of the Fittest Representation: A Case Study with Modular Addition

☆Tackling Polysemanticity with Neuron Embeddings

☆The Concept Percolation Hypothesis: Analyzing the Emergence of Capabilities in Neural Networks Trained on Formal Grammars

☆The Geometry of Categorical and Hierarchical Concepts in Large Language Models

Adversarial Circuit Evaluation

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Analyzing the Generalization and Reliability of Steering Vectors

Attention with Markov: A Curious Case of Single-layer Transformers

Automatically Identifying Local and Global Circuits with Linear Computation Graphs

Benchmarking Mental State Representations in Language Models

Challenges in Mechanistically Interpreting Model Representations

Cluster-Norm for Unsupervised Probing of Knowledge

Comgra: A Tool for Analyzing and Debugging Neural Networks

Compact Proofs of Model Performance via Mechanistic Interpretability

Confidence Regulation Neurons in Language Models

Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

Controlling Large Language Model Agents with Entropic Activation Steering

CoSy: Evaluating Textual Explanations of Neurons

Crafting Large Language Models for Enhanced Interpretability

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Delay Embedding Theory of Neural Sequence Models

Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models

Dissecting Query-Key Interaction in Vision Transformers

Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

Does Editing Provide Evidence for Localization?

Exploring the Internal Mechanisms of Music LLMs: A Study of Root and Quality via Probing and Intervention Techniques

Extracting Finite State Machines from Transformers

Faithful and Fast Influence Function via Advanced Sampling

Finding Visual Task Vectors

From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport

Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

Grokking and the Geometry of Circuit Formation

Grokking, Rank Minimization and Generalization in Deep Learning

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

How do Llamas process multilingual text? A latent exploration through activation patching

How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator

How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion

How Truncating Weights Improves Reasoning in Language Models

Hypothesis Testing the Circuit Hypothesis in LLMs

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders

Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Interpreting Attention Layer Outputs with Sparse Autoencoders

InversionView: A General-Purpose Method for Reading Information from Neural Activations

Investigating the Indirect Object Identification circuit in Mamba

Investigating the Interpretability of Biometric Face Templates Using Gated Sparse Autoencoders and Differentiable Image Parametrizations

Is Transformer a Stochastic Parrot? A Case Study in Simple Arithmetic Task

Iteration Head: A Mechanistic Study of Chain-of-Thought

Language Models Linearly Represent Sentiment

Learning and Unlearning of Fabricated Knowledge in Language Models

Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

LLM Circuit Analyses Are Consistent Across Training and Scale

Localizing Auditory Concepts in CNNs

Logical Distillation of Graph Neural Networks

Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

Loss in the Crowd: Hidden Breakthroughs in Language Model Training

Manipulating Feature Visualizations with Gradient Slingshots

Mathematical Models of Computation in Superposition

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Mechanistic Interpretability of Binary and Ternary Transformer Networks

Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks

Modularity in Biologically Inspired Representations Depends on Task Variable Range Independence

Neuroplasticity and Corruption in Model Mechanisms: A case study of Indirect Object Identification

On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task

Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data

Planning behavior in a recurrent neural network that plays Sokoban

Progressive distillation improves feature learning via implicit curriculum

Refusal in Language Models Is Mediated by a Single Direction

Relational Composition in Neural Networks: A Survey and Call to Action

ReLU MLPs Can Compute Numerical Integration: Mechanistic Interpretation of a Non-linear Activation

Representing Rule-based Chatbots with Transformers

Robust Unlearning via Mechanistic Localizations

Segmentation CNNs are denoising models

Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task

Survival of the Fittest Representation: A Case Study with Modular Addition

Tackling Polysemanticity with Neuron Embeddings

The Concept Percolation Hypothesis: Analyzing the Emergence of Capabilities in Neural Networks Trained on Formal Grammars

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision