ICML 2024 Past Interpretability

ICML 2024 Workshop on Mechanistic Interpretability

ICML 2024 MI Workshop

Submission deadline
May 30, 2024, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (93)

Fetched from OpenReview (v2) on 2026-06-10.

  1. Adversarial Circuit Evaluation

    Niels uit de Bos, Adrià Garriga-Alonso · PDF
  2. An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

    Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau · PDF
  3. Analyzing the Generalization and Reliability of Steering Vectors

    Daniel Chee Hian Tan, David Chanin, Aengus Lynch, Adrià Garriga-Alonso, Dimitrios Kanoulas, Brooks Paige, Robert Kirk · PDF
  4. Attention with Markov: A Curious Case of Single-layer Transformers

    Ashok Vardhan Makkuva, Marco Bondaschi, Alliot Nagle, Adway Girish, Hyeji Kim, Martin Jaggi, Michael Gastpar · PDF
  5. Automatically Identifying Local and Global Circuits with Linear Computation Graphs

    Xuyang Ge, Fukang Zhu, Wentao Shu, Junxuan Wang, Zhengfu He, Xipeng Qiu · PDF
  6. Benchmarking Mental State Representations in Language Models

    Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling · PDF
  7. Challenges in Mechanistically Interpreting Model Representations

    Satvik Golechha, James Dao · PDF
  8. Cluster-Norm for Unsupervised Probing of Knowledge

    Walter Laurito, Sharan Maiya, Grégoire DHIMOÏLA, Owen Ho Wan Yeung, Kaarel Hänni · PDF
  9. Comgra: A Tool for Analyzing and Debugging Neural Networks

    Florian Dietz, Sophie Fellenz, Dietrich Klakow, Marius Kloft · PDF
  10. Compact Proofs of Model Performance via Mechanistic Interpretability

    Jason Gross, Rajashree Agrawal, Thomas Kwa, Euan Ong, Chun Hei Yip, Alex Gibson, Soufiane Noubir, Lawrence Chan · PDF
  11. Confidence Regulation Neurons in Language Models

    Alessandro Stolfo, Ben Peng Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda · PDF
  12. Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

    Yoann Poupart · PDF
  13. Controlling Large Language Model Agents with Entropic Activation Steering

    Nate Rahn, Pierluca D'Oro, Marc G Bellemare · PDF
  14. CoSy: Evaluating Textual Explanations of Neurons

    Laura Kopf, Philine Lou Bommer, Anna Hedström, Sebastian Lapuschkin, Marina MC Höhne, Kirill Bykov · PDF
  15. Crafting Large Language Models for Enhanced Interpretability

    Chung-En Sun, Tuomas Oikarinen, Tsui-Wei Weng · PDF
  16. Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

    Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi · PDF
  17. Delay Embedding Theory of Neural Sequence Models

    Mitchell Ostrow, Adam Joseph Eisen, Ila R Fiete · PDF
  18. Describe-and-Dissect: Interpreting Neurons in Vision Networks with Language Models

    Nicholas Bai, Rahul Ajay Iyer, Tuomas Oikarinen, Tsui-Wei Weng · PDF
  19. Dissecting Query-Key Interaction in Vision Transformers

    Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz · PDF
  20. Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers

    Yibo Jiang, Goutham Rajendran, Pradeep Kumar Ravikumar, Bryon Aragam · PDF
  21. Does Editing Provide Evidence for Localization?

    Zihao Wang, Victor Veitch · PDF
  22. Exploring the Internal Mechanisms of Music LLMs: A Study of Root and Quality via Probing and Intervention Techniques

    Wenye Ma, Gus Xia · PDF
  23. Extracting Finite State Machines from Transformers

    Rik Adriaensen, Jaron Maene · PDF
  24. Faithful and Fast Influence Function via Advanced Sampling

    Jungyeon Koh, Hyeonsu Lyu, Jonggyu Jang, Hyun Jong Yang · PDF
  25. Finding Visual Task Vectors

    Alberto Hojel, Yutong Bai, Trevor Darrell, Amir Globerson, Amir Bar · PDF
  26. From Alexnet to Transformers: Measuring the Non-linearity of Deep Neural Networks with Affine Optimal Transport

    Quentin Bouniot, Ievgen Redko, Anton Mallasto, Charlotte Laclau, Oliver Struckmeier, Karol Arndt, Markus Heinonen, Ville Kyrki, Samuel Kaski · PDF
  27. Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

    Boshi Wang, Xiang Yue, Yu Su, Huan Sun · PDF
  28. Grokking and the Geometry of Circuit Formation

    Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk · PDF
  29. Grokking, Rank Minimization and Generalization in Deep Learning

    David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Henrique Pamplona Savarese, Gal Vardi, Karen Livescu, Michael Maire, Matthew Walter · PDF
  30. Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

    Michael Hanna, Sandro Pezzelle, Yonatan Belinkov · PDF
  31. How do Llamas process multilingual text? A latent exploration through activation patching

    Clément Dumas, Veniamin Veselovsky, Giovanni Monea, Robert West, Chris Wendler · PDF
  32. How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator

    Subhash Kantamneni, Ziming Liu, Max Tegmark · PDF
  33. How Do Transformers Fill in the Blanks? A Case Study on Matrix Completion

    Pulkit Gopalani, Ekdeep Singh Lubana, Wei Hu · PDF
  34. How Truncating Weights Improves Reasoning in Language Models

    Lei Chen, Joan Bruna, Alberto Bietti · PDF
  35. Hypothesis Testing the Circuit Hypothesis in LLMs

    Claudia Shi, Nicolas Beltran-Velez, Achille Nazaret, Carolina Zheng, Adrià Garriga-Alonso, Andrew Jesson, Maggie Makar, David Blei · PDF
  36. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

    Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey · PDF
  37. Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, Neel Nanda · PDF
  38. Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition

    Kenzo Clauw, Daniele Marinazzo, Sebastiano Stramaglia · PDF
  39. InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

    Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso · PDF
  40. Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities

    Nhat Le, Ciyue Shen, Chintan Shah, Blake Martin, Daniel Shenker, Harshith Padigela, Jennifer A. Hipp, Sean Grullon, John Abel, Harsha Vardhan pokkalla, Dinkar Juyal · PDF
  41. Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

    Karolis Jucys, George Adamopoulos, Mehrab Hamidi, Stephanie Milani, Mohammad Reza Samsami, Artem Zholus, Sonia Joseph, Blake Aaron Richards, Irina Rish, Özgür Şimşek · PDF
  42. Interpreting Attention Layer Outputs with Sparse Autoencoders

    Connor Kissane, Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda · PDF
  43. InversionView: A General-Purpose Method for Reading Information from Neural Activations

    Xinting Huang, Madhur Panwar, Navin Goyal, Michael Hahn · PDF
  44. Investigating the Indirect Object Identification circuit in Mamba

    Danielle Ensign, Adrià Garriga-Alonso · PDF
  45. Investigating the Interpretability of Biometric Face Templates Using Gated Sparse Autoencoders and Differentiable Image Parametrizations

    Peter Rot, Klemen Grm · PDF
  46. Is Transformer a Stochastic Parrot? A Case Study in Simple Arithmetic Task

    Peixu Wang, Chen Yu, Yu Ming · PDF
  47. Iteration Head: A Mechanistic Study of Chain-of-Thought

    Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Xingyu Alice Yang, Francois Charton, Julia Kempe · PDF
  48. Language Models Linearly Represent Sentiment

    Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda · PDF
  49. Learning and Unlearning of Fabricated Knowledge in Language Models

    Chen Sun, Nolan Andrew Miller, Andrey Zhmoginov, Max Vladymyrov, Mark Sandler · PDF
  50. Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically

    Kabir Ahuja, Vidhisha Balachandran, Madhur Panwar, Tianxing He, Noah A. Smith, Navin Goyal, Yulia Tsvetkov · PDF
  51. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

    Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov · PDF
  52. LLM Circuit Analyses Are Consistent Across Training and Scale

    Curt Tigges, Michael Hanna, Qinan Yu, Stella Biderman · PDF
  53. Localizing Auditory Concepts in CNNs

    Pratyaksh Gautam, Makarand Tapaswi, Vinoo Alluri · PDF
  54. Logical Distillation of Graph Neural Networks

    Alexander Pluska, Pascal Welke, Thomas Gärtner, SAGAR MALHOTRA · PDF
  55. Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models

    Alexandre Variengien, Eric Winsor · PDF
  56. Loss in the Crowd: Hidden Breakthroughs in Language Model Training

    Sara Kangaslahti, Elan Rosenfeld, Naomi Saphra · PDF
  57. Manipulating Feature Visualizations with Gradient Slingshots

    Dilyara Bareeva, Marina MC Höhne, Alexander Warnecke, Lukas Pirch, Klaus Robert Muller, Konrad Rieck, Kirill Bykov · PDF
  58. Mathematical Models of Computation in Superposition

    Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan · PDF
  59. Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

    Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks · PDF
  60. Mechanistic Interpretability of Binary and Ternary Transformer Networks

    Jason Li · PDF
  61. Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks

    Aaron Mueller · PDF
  62. Modularity in Biologically Inspired Representations Depends on Task Variable Range Independence

    Will Dorrell, Kyle Hsu, Luke Hollingsworth, Jin Hwa Lee, Jiajun Wu, Chelsea Finn, Peter E. Latham, Timothy Edward John Behrens, James C. R. Whittington · PDF
  63. Neuroplasticity and Corruption in Model Mechanisms: A case study of Indirect Object Identification

    Vishnu Kabir Chhabra, Ding Zhu, Mohammad Mahdi Khalili · PDF
  64. On the Similarity of Circuits across Languages: a Case Study on the Subject-verb Agreement Task

    Javier Ferrando, Marta R. Costa-jussà · PDF
  65. Penzai + Treescope: A Toolkit for Interpreting, Visualizing, and Editing Models As Data

    Daniel D. Johnson · PDF
  66. Planning behavior in a recurrent neural network that plays Sokoban

    Adrià Garriga-Alonso, Mohammad Taufeeque, Adam Gleave · PDF
  67. Progressive distillation improves feature learning via implicit curriculum

    Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel · PDF
  68. Refusal in Language Models Is Mediated by a Single Direction

    Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda · PDF
  69. Relational Composition in Neural Networks: A Survey and Call to Action

    Martin Wattenberg, Fernanda Viégas · PDF
  70. ReLU MLPs Can Compute Numerical Integration: Mechanistic Interpretation of a Non-linear Activation

    Chun Hei Yip, Rajashree Agrawal, Jason Gross · PDF
  71. Representing Rule-based Chatbots with Transformers

    Dan Friedman, Abhishek Panigrahi, Danqi Chen · PDF
  72. Robust Unlearning via Mechanistic Localizations

    Phillip Huang Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, Gintare Karolina Dziugaite · PDF
  73. Segmentation CNNs are denoising models

    Luis A. Zavala-Mondragón, Ruud Van Sloun, Peter H.N. de With, Fons van der Sommen · PDF
  74. Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

    Min Cai, Yuchen Zhang, Shichang Zhang, Fan Yin, Difan Zou, Yisong Yue, Ziniu Hu · PDF
  75. Sparse Autoencoders Match Supervised Features for Model Steering on the IOI Task

    Aleksandar Makelov · PDF
  76. Survival of the Fittest Representation: A Case Study with Modular Addition

    Xiaoman Delores Ding, Zifan Carl Guo, Eric J Michaud, Ziming Liu, Max Tegmark · PDF
  77. Tackling Polysemanticity with Neuron Embeddings

    Alex Foote · PDF
  78. The Concept Percolation Hypothesis: Analyzing the Emergence of Capabilities in Neural Networks Trained on Formal Grammars

    Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert P. Dick, Hidenori Tanaka · PDF
  79. The Geometry of Categorical and Hierarchical Concepts in Large Language Models

    Kiho Park, Yo Joong Choe, Yibo Jiang, Victor Veitch · PDF
  80. The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision

    Liv Gorton · PDF
  81. The Remarkable Robustness of LLMs: Stages of Inference?

    Vedang Lad, Wes Gurnee, Max Tegmark · PDF
  82. Tokenized SAEs: Disentangling SAE Reconstructions

    Thomas Dooms, Daniel Wilhelm · PDF
  83. TracrBench: Generating Interpretability Testbeds with Large Language Models

    Hannes Thurnherr, Jérémy Scheurer · PDF
  84. Transcoders find interpretable LLM feature circuits

    Jacob Dunefsky, Philippe Chlenski, Neel Nanda · PDF
  85. Transformers on Markov data: Constant depth suffices

    Nived Rajaraman, Marco Bondaschi, Ashok Vardhan Makkuva, Kannan Ramchandran, Michael Gastpar · PDF
  86. Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models

    Sunny Duan, Mikail Khona, Abhiram Iyer, Rylan Schaeffer, Ila R Fiete · PDF
  87. Understanding Counting in Small Transformers: The Interplay between Attention and Feed-Forward Layers

    Freya Behrens, Luca Biggio, Lenka Zdeborova · PDF
  88. Understanding Inhibition through Maximally Tense Images

    Christopher J Hamblin, Srijani Saha, Talia Konkle, George A. Alvarez · PDF
  89. Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

    Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn · PDF
  90. Visualizing Neural Network Imagination

    Nevan Wichers, Victor Tao, Riccardo Volpato, Fazl Barez · PDF
  91. Weight-based Decomposition: A Case for Bilinear MLPs

    Michael T Pearce, Thomas Dooms, Alice Rigg · PDF
  92. What Makes and Breaks Safety Fine-tuning? A Mechanistic Study

    Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip Torr, Amartya Sanyal, Puneet K. Dokania · PDF
  93. Why do recurrent neural networks suddenly learn? Bifurcation mechanisms in neuro-inspired short-term memory tasks

    Udith Haputhanthri, Liam Storan, Yiqi Jiang, Adam Shai, Hakki Orhun Akengin, Mark Schnitzer, Fatih Dinc, Hidenori Tanaka · PDF