ICML 2024 Past Interpretability
ICML 2024 Workshop on Mechanistic Interpretability
ICML 2024 MI Workshop
- Submission deadline
- May 30, 2024, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (93)
Fetched from OpenReview (v2) on 2026-06-10.
-
Adversarial Circuit Evaluation
-
An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L
-
Analyzing the Generalization and Reliability of Steering Vectors
-
Attention with Markov: A Curious Case of Single-layer Transformers
-
Automatically Identifying Local and Global Circuits with Linear Computation Graphs
-
Benchmarking Mental State Representations in Language Models
-
Challenges in Mechanistically Interpreting Model Representations
-
Cluster-Norm for Unsupervised Probing of Knowledge
-
Comgra: A Tool for Analyzing and Debugging Neural Networks
-
Compact Proofs of Model Performance via Mechanistic Interpretability
-
Confidence Regulation Neurons in Language Models
-
Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
-
Controlling Large Language Model Agents with Entropic Activation Steering
-
CoSy: Evaluating Textual Explanations of Neurons
-
Crafting Large Language Models for Enhanced Interpretability
-
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
-
Delay Embedding Theory of Neural Sequence Models
-
Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models
-
Dissecting Query-Key Interaction in Vision Transformers
-
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
-
Does Editing Provide Evidence for Localization?
-
Exploring the Internal Mechanisms of Music LLMs: A Study of Root and Quality via Probing and Intervention Techniques
-
Extracting Finite State Machines from Transformers
-
Faithful and Fast Influence Function via Advanced Sampling
-
Finding Visual Task Vectors
-
From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport
-
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
-
Grokking and the Geometry of Circuit Formation
-
Grokking, Rank Minimization and Generalization in Deep Learning
-
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
-
How do Llamas process multilingual text? A latent exploration through activation patching
-
How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator
-
How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion
-
How Truncating Weights Improves Reasoning in Language Models
-
Hypothesis Testing the Circuit Hypothesis in LLMs
-
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
-
Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders
-
Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition
-
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
-
Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities
-
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent
-
Interpreting Attention Layer Outputs with Sparse Autoencoders
-
InversionView: A General-Purpose Method for Reading Information from Neural Activations
-
Investigating the Indirect Object Identification circuit in Mamba
-
Investigating the Interpretability of Biometric Face Templates Using Gated Sparse Autoencoders and Differentiable Image Parametrizations
-
Is Transformer a Stochastic Parrot? A Case Study in Simple Arithmetic Task
-
Iteration Head: A Mechanistic Study of Chain-of-Thought
-
Language Models Linearly Represent Sentiment
-
Learning and Unlearning of Fabricated Knowledge in Language Models
-
Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically
-
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks
-
LLM Circuit Analyses Are Consistent Across Training and Scale
-
Localizing Auditory Concepts in CNNs
-
Logical Distillation of Graph Neural Networks
-
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
-
Loss in the Crowd: Hidden Breakthroughs in Language Model Training
-
Manipulating Feature Visualizations with Gradient Slingshots
-
Mathematical Models of Computation in Superposition
-
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
-
Mechanistic Interpretability of Binary and Ternary Transformer Networks
-
Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks
-
Modularity in Biologically Inspired Representations Depends on Task Variable Range Independence
-
Neuroplasticity and Corruption in Model Mechanisms: A case study of Indirect Object Identification
-
On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task
-
Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data
-
Planning behavior in a recurrent neural network that plays Sokoban
-
Progressive distillation improves feature learning via implicit curriculum
-
Refusal in Language Models Is Mediated by a Single Direction
-
Relational Composition in Neural Networks: A Survey and Call to Action
-
ReLU MLPs Can Compute Numerical Integration: Mechanistic Interpretation of a Non-linear Activation
-
Representing Rule-based Chatbots with Transformers
-
Robust Unlearning via Mechanistic Localizations
-
Segmentation CNNs are denoising models
-
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
-
Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task
-
Survival of the Fittest Representation: A Case Study with Modular Addition
-
Tackling Polysemanticity with Neuron Embeddings
-
The Concept Percolation Hypothesis: Analyzing the Emergence of Capabilities in Neural Networks Trained on Formal Grammars
-
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
-
The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision
-
The Remarkable Robustness of LLMs: Stages of Inference?
-
Tokenized SAEs: Disentangling SAE Reconstructions
-
TracrBench: Generating Interpretability Testbeds with Large Language Models
-
Transcoders find interpretable LLM feature circuits
-
Transformers on Markov data: Constant depth suffices
-
Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models
-
Understanding Counting in Small Transformers: The Interplay between Attention and Feed-Forward Layers
-
Understanding Inhibition through Maximally Tense Images
-
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
-
Visualizing Neural Network Imagination
-
Weight-based Decomposition: A Case for Bilinear MLPs
-
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
-
Why do recurrent neural networks suddenly learn? Bifurcation mechanisms in neuro-inspired short-term memory tasks