NeurIPS 2025PastInterpretability

Mechanistic Interpretability Workshop at NeurIPS 2025

Mech Interp Workshop (NeurIPS 2025)

Official website ↗OpenReview venue ↗See all NeurIPS workshops →✎ Edit this entry

Submission deadline: Aug 23, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (187)

Fetched from OpenReview (v2) on 2026-06-10.

Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis
Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell · PDF
Activation Transport Operators
Andrzej Szablewski, Marek Masiak · PDF
Adaptive Task Vectors for Large Language Models
Joonseong Kang, Soojeong Lee, Sumin Park, Subeen Park, Taero Kim, Jihee Kim, Ryunyi LEE, Kyungwoo Song · PDF
Adversarial Attacks Leverage Interference Between Features in Superposition
Edward Stevinson, Lucas Prieto, Melih Barsbey, Tolga Birdal · PDF
Adversarial Examples Are Not Bugs, They Are Superposition
Liv Gorton, Owen Lewis · PDF
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · PDF
Angular Steering: Behavior Control via Rotation in Activation Space
Hieu M. Vu, Tan Minh Nguyen · PDF
Attention Layers Add Into Low-Dimensional Residual Subspaces
Junxuan Wang, Xuyang Ge, Wentao Shu, Zhengfu He, Xipeng Qiu · PDF
Attention Pattern Discovery at Scale
Jonathan Katzy, Razvan Mihai Popescu, Erik Mekkes, Arie van Deursen, Maliheh Izadi · PDF
Attributing Response to Context: A Jensen–Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz · PDF
Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent
Christy Li, Josep Lopez Camuñas, Jake Thomas Touchet, Jacob Andreas, Agata Lapedriza, Antonio Torralba, Tamar Rott Shaham · PDF
Automatically Finding Rule-Based Neurons in OthelloGPT
Aditya Singh, Zihang Wen, Srujananjali Medicherla, Adam Karvonen, Can Rager · PDF
Base Models Know How to Reason, Thinking Models Learn When
Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda · PDF
Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions
Dat Minh Hong, Bruno Kacper Mlodozeniec, Runa Eschenhagen, Richard E. Turner · PDF
Better World Models Can Lead to Better Post-Training Performance
Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, Andrew Lee · PDF
Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality
Lingjing Kong, Shaoan Xie, Guangyi Chen, Yuewen Sun, Xiangchen Song, Eric P. Xing, Kun Zhang · PDF
Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models
Ej Zhou, Caiqi Zhang, Tiancheng Hu, Chengzu Li, Nigel Collier, Ivan Vulić, Anna Korhonen · PDF
Bilinear Convolution Decomposition for Causal RL Interpretability
Sinem Erisken, Alice Rigg, Narmeen Fatimah Oozeer · PDF
Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed
Michał Brzozowski · PDF
Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators
Dani Roytburg, Matthew Nguyen, Matthew Bozoukov, Jou Barzdukas, Hongyu Fu, Narmeen Fatimah Oozeer · PDF
Can Interpretation Predict Behavior on Unseen Data?
Victoria R Li, Jenny Kaufmann, Martin Wattenberg, David Alvarez-Melis, Naomi Saphra · PDF
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina MC Höhne, Oliver Eberle · PDF
Causal Discovery and Inference through Next-Token Prediction
Eivinas Butkus, Nikolaus Kriegeskorte · PDF
Centroid Affinity: How Deep Networks Represent Features
Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk · PDF
Circuit-Tracer: A New Library for Finding Feature Circuits
Michael Hanna, Mateusz Piotrowski, Jack Lindsey, Emmanuel Ameisen · PDF
Comparing Clinical and General LLMs on Knowledge Boundaries and Robustness
Xingmeng Zhao, Ke Yang, Anthony Rios · PDF
Composable Sparse Subnetworks via Maximum-Entropy Principle
Francesco Caso, Samuele Fonio, Nicola Saccomanno, Simone Monaco, Fabrizio Silvestri · PDF
Compressed Computation is (probably) not Computation in Superposition
Jai Bhagat, Sara Molas-Medina, Giorgi Giglemiani, Stefan Heimersheim · PDF
Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem
Adam Newgas · PDF
Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios
Isha Agarwal, Saharsha Navani, Fazl Barez · PDF
ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation
Robert Graham, Edward Stevinson, Leo Richter, Alexander Chia, Joseph Miller, Joseph Isaac Bloom · PDF
Control and Predictivity in Neural Interpretability
Satchel Grant, Alexa R. Tartaglini · PDF
Controlling Vision–Language–Action Policies through Sparse Latent Directions
Momin Ahmad Khan, Novak Boskov, Fatima M. Anwar, Manzoor A. Khan · PDF
Convergent Linear Representations of Emergent Misalignment
Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda · PDF
Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition
Lucas Prieto, Edward Stevinson, Melih Barsbey, Tolga Birdal, Pedro A. M. Mediano · PDF
Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs
Thomas Jiralerspong, Trenton Bricken · PDF
Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing
McNair Shah, Saleena Angeline Sartawita, Adhitya Rajendra Kumar, Naitik Chheda, Will Cai, Kevin Zhu, Sean O'Brien, Vasu Sharma · PDF
Decomposing Attention To Find Context-Sensitive Neurons
Alex Gibson · PDF
Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
Xinting Huang, Michael Hahn · PDF
Decomposition of Small Transformer Models
Casper L. Christensen, Logan Riggs Smith · PDF
Demystifying Cipher-Following in Large Language Models via Activation Analysis
Megan Gross, Yigitcan Kaya, Christopher Kruegel, Giovanni Vigna · PDF
Dense SAE Latents Are Features, Not Bugs
Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Peng Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark · PDF
Detecting and Characterizing Planning in Language Models
Jatin Nainani, Sankaran Vaidyanathan, Connor Watts, Andre N. Assis, Alice Rigg · PDF
Detecting Motivated Reasoning in the Internal Representations of Language Models
Parsa Mirtaheri, Mikhail Belkin · PDF
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework
Hao Gu, Vibhas Nair, Amrithaa Ashok Kumar, Ryan Lagasse, Kevin Zhu, Sean O'Brien, Ashwinee Panda · PDF
Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C Wallace · PDF
Do We Always Need Sampling? Eliciting Numerical Predictive Distributions of LLMs Without Auto-Regression
Julianna Piskorz, Kasia Kobalczyk, Mihaela van der Schaar · PDF
Does FLUX Know What It’s Writing?
Adrian Chang, Sheridan Feucht, Byron C Wallace, David Bau · PDF
Don't Believe the Belief Hype!
Alessandro Corona Mendozza · PDF
Dual Mechanisms of Value Expression: Decomposing Intrinsic and Prompted Values in Language Models
Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo · PDF
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Erblina Purelku, Sebastian Lapuschkin, Wojciech Samek · PDF
Eliciting Secret Knowledge from Language Models
Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks · PDF
Emergence of Linear Truth Encodings in Language Models
Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, Alberto Bietti · PDF
Emergent Specialization: Rare Token Neurons in Language Models
Jing Liu, Yueheng Li, Haozheng Wang · PDF
Emergent World Beliefs: Exploring Transformers in Stochastic Games
Adam Kamel, Tanish Rastogi, Michael Ma, Kailash Ranganathan, Kevin Zhu · PDF
Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models
Eric Lacosse, Mariana Duarte, Peter Todd, Daniel C McNamee · PDF
Enforcing Orderedness in SAEs to Improve Feature Consistency
Sophie L. Wang, Alex Quach, Nithin Parsan, John Jingxuan Yang · PDF
Entity Multiplexing Through Activation Strength: Understanding goals in A Maze Solving Agent
Benjamin Sturgeon, Jonathan P. Shock · PDF
Equivalent Linear Mappings of Large Language Models
James Robert Golden · PDF
Evaluating Explanatory Evaluations: An Explanatory Virtues Framework for Mechanistic Interpretability
Kola Ayonrinde, Louis Jaburi · PDF
Evaluating SAE interpretability without explanations
Gonçalo Paulo, Nora Belrose · PDF
Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability
Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng · PDF
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
Cheng Wang, Zeming Wei, Qin Liu, Wenxuan Zhou, Muhao Chen · PDF
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
David Chanin, Tomáš Dulka, Adrià Garriga-Alonso · PDF
Feature interactions in sparse crosscoders from compact proofs
Dmitry Manning-Coe, Thomas Read, Anna Soligo, Oliver Clive-Griffin, Chun Hei Yip, Alex Gibson, Rajashree Agrawal, Jason Gross · PDF
Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts
Samaksh Bhargav, Zining Zhu · PDF
Finding Manifolds with Bilinear Autoencoders
Thomas Dooms, Ward Gauderis · PDF
Fluid Reasoning Representations
Dmitrii Kharlapenko, Alessandro Stolfo, Arthur Conmy, Mrinmaya Sachan, Zhijing Jin · PDF
From Black-box to Causal-box: Towards Building More Interpretable Models
Inwoo Hwang, Yushu Pan, Elias Bareinboim · PDF
From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
Karim Saraipour, Shichang Zhang · PDF
From Local to Contextually-Enriched Local Representations: A Mechanism for Holistic Processing in DINOv2 ViTs
Fenil R. Doshi, Thomas Fel, Talia Konkle, George A. Alvarez · PDF
From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna, Sattvik Sahai, Prasoon Goyal, Kai-Wei Chang, Tao Zhang, Rahul Gupta · PDF
From Tokens to Semantics: The Emergence and Stabilization of Polysemanticity in Language Models
Sharvil Limaye, Aniruddhan Ramesh, Aiden Zhou, Akshay Bhaskar, Jonas Rohweder, Ashwinee Panda, Vasu Sharma · PDF
Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition
Qinyuan Ye, Robin Jia, Xiang Ren · PDF
Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders
Ege Erdogan, Ana Lucic · PDF
Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning
Wannan Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano · PDF
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga · PDF
Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task
Brady Bhalla, Honglu Fan, Nancy Chen, Tony Yue YU · PDF
Higher-Order Component Attribution via Kolmogorov–Arnold Networks
Samy Mammeri, Christian Gagné · PDF
How does Mamba Perform Associative Recall? A Mechanistic Study
Grégoire LE CORRE, Ningyuan Huang, Alberto Bietti · PDF
Instruction Following by Boosting Attention of Large Language Models
Vitoria Guardieiro, Avishree Khare, Adam Stein, Eric Wong · PDF
InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation
Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu · PDF
Interpretability at the Network Level: Prior-Guided Drift Diffusion for Neural Circuit Analysis
Tahereh Toosi · PDF
Interpretability for Time Series Transformers using A Concept Bottleneck Framework
Angela van Sprang, Erman Acar, Willem Zuidema · PDF
Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
Nicholas Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, Neel Nanda · PDF
Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision–Language Models
Jinyeong Kim, Seil Kang, Jiwoo Park, Junhyeok Kim, Seong Jae Hwang · PDF
Interpreting ResNet-based CLIP via Neuron-Attention Decomposition
Edmund Bu, Yossi Gandelsman · PDF
Interpreting Vision Grounding in Vision-Language Models: A Case Study in Coordinate Prediction
Clement Neo, Yongsen Zheng, Kwok-Yan Lam, Luke Ong · PDF
Iterative Inference in a Chess-Playing Neural Network
Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek · PDF
Just-in-time and distributed task representations in language models
Yuxuan Li, Declan Iain Campbell, Stephanie C.Y. Chan, Andrew Kyle Lampinen · PDF
Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger · PDF
Latent Crystallographic Microscope: Probing the Emergent Crystallographic Knowledge in Large Language Models
Jingru Gan, Yanqiao Zhu, Wei Wang · PDF
Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations
Mauri Diaz · PDF
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow · PDF
Learning to Steer: Input-dependent Steering for Multimodal LLMs
Jayneel Parekh, Pegah KHAYATAN, Mustafa Shukor, Arnaud Dapogny, Alasdair Newson, Matthieu Cord · PDF
LLM Pretraining with Continuous Concepts
Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason E Weston, Xian Li · PDF
LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS
Stefan F. Schouten, Peter Bloem · PDF
Localizing Reasoning Training-Induced Changes in Large Language Models
Max Klabunde, Florian Lemmerich · PDF
Looking into Black Box Code Language Models
Muhammad Umair Haider, Umar Farooq, A.B. Siddique, Mark Marron · PDF
Mapping Faithful Reasoning in Language Models
Jiazheng Li, Andreas Damianou, J Rosser, Jose Luis Redondo Garcia, Konstantina Palla · PDF
Measuring Sparse Autoencoder Feature Sensitivity
Claire Tian, Katherine Tian, Nathan Zixia Hu · PDF
Mechanistic Evaluation of Transformers and State-Space Models
Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csordás, Dan Jurafsky, Christopher Potts · PDF
Mechanistic evidence that motif-gated domain recognition drives contact prediction in protein language models
Jatin Nainani, Bryn Marie Reimer, Connor Watts, David Jensen, Anna G. Green · PDF
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG
Maxime Méloux, François Portet, Maxime Peyrard · PDF
Mitigating Emergent Misalignment with Data Attribution
Louis Jaburi, Gonçalo Paulo, Stepan Shabalin, Lucia Quirke, Nora Belrose · PDF
Mitigating Sycophancy in Language Models via Sparse Activation Fusion and Multi-Layer Activation Steering
Pyae Phoo Min, Avigya Paudel, Naufal Adityo, Arthur Zhu, Andrew Rufail, Cole Blondin, Kevin Zhu, Sunishchal Dev, Sean O'Brien · PDF
Model Organisms for Emergent Misalignment
Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda · PDF
Motifs in Attention Patterns of Large Language Models
Michael Ivanitskiy, Cecilia Diniz Behn, Samy Wu Fung · PDF
Multimodal Concept Bottleneck Models
Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng · PDF
Multiple Streams of Knowledge Retrieval: Enriching and Recalling in Transformers
Todd Nief, David Reber, Sean M. Richardson, Ari Holtzman · PDF
Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences
Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda · PDF
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A.B. Siddique · PDF
nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers
Clément Dumas · PDF
On the Geometry and Topology of Neural Circuits for Modular Addition
Gabriela Moisescu-Pareja, Gavin McCracken, Harley Wiltzer, Colin Daniels, Vincent Létourneau, Jonathan Love · PDF
On the Limits of Linear Representation Hypotheses in Large Language Models: A Dynamical Systems Analysis
Abhinav Muraleedharan · PDF
Open-Vocabulary Natural-Language Explanations of LLM Activations via Soft Prompts
Bart Bussmann · PDF
OpenMAIA: a Multimodal Automated Interpretability Agent based on open-source models
Josep Lopez Camuñas, Christy Li, Tamar Rott Shaham, Antonio Torralba, Agata Lapedriza · PDF
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Clément Dumas, Julian Minder, Caden Juang, Bilal Chughtai, Neel Nanda · PDF
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso · PDF
Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model
Rio Alexa Fear, Payel Mukhopadhyay, Michael McCabe, Alberto Bietti, Miles Cranmer · PDF
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Krishna Kanth Nakka, Dmitrii Usynin, Xue Jiang, Xuebing Zhou · PDF
Pinpointing Attention-Causal Communication in Language Models
Gabriel Franco, Mark Crovella · PDF
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang · PDF
Predicting Weak-to-Strong Generalization from Latent Representations
Ben Wilop, Christian Schroeder de Witt, Yarin Gal, Philip Torr, Constantin Venhoff · PDF
Probing by Analogy: Decomposing Probes into Activations for Better Interpretability and Inter-Model Generalization
Patrick Leask, Noura Al Moubayed · PDF
Quiet Feature Learning in Algorithmic Tasks
Prudhviraj Naidu, Zixian Wang, Leon Bergen, Ramamohan Paturi · PDF
Rank-1 LoRAs Encode Interpretable Reasoning Signals
Jake Ward, Paul M. Riechers, Adam Shai · PDF
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability
Chung-En Sun, Ge Yan, Akshay R. Kulkarni, Tsui-Wei Weng · PDF
ReflCtrl: Controlling LLM Reflection via Representation Engineering
Ge Yan, Chung-En Sun, Tsui-Wei Weng · PDF
RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching
Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda · PDF
Representation Similarity Reveals Implicit Layer Grouping in Neural Networks
Tian Gao, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Dennis Wei · PDF
Rethinking Crowd-Sourced Evaluation of Neuron Explanations
Tuomas Oikarinen, Ge Yan, Akshay R. Kulkarni, Tsui-Wei Weng · PDF
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
Antonio Barbalau, Cristian Daniel Paduraru, Teodor Poncu, Alexandru Tifrea, Elena Burceanu · PDF
Reverse Engineering a Stateful Reasoning Circuit
Akshit Kumar, Dipti Sharma, Parameswari Krishnamurthy · PDF
Reverse-Engineering Memory in DreamerV3: From Sparse Representations to Functional Circuits
Jan Sobotka, Auke Ijspeert, Guillaume Bellegarda · PDF
RippleBench: Capturing Ripple Effects by Leveraging Existing Knowledge Repositories
Roy Rinberg, Usha Bhalla, Igor Shilov, Rohit Gandikota · PDF
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Rohan Gupta, Erik Jenner · PDF
Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Adam Karvonen, Samuel Marks · PDF
SAE-ception: Iteratively Using Sparse Autoencoders as a Training Signal
Alex Bishka · PDF
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Yaniv Nikankin, Dana Arad, Yossi Gandelsman, Yonatan Belinkov · PDF
Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models
Sayam Goyal, Brad Peters, María Emilia Granda, Akshath Vijayakumar Narmadha, Dharunish Yugeswardeenoo, Callum Stuart McDougall, Sean O'Brien, Ashwinee Panda, Kevin Zhu, Cole Blondin · PDF
Shared Memorization Structures in Transformers Revealed by Loss Curvature
Jack Merullo, Srihita Vatsavaya, Owen Lewis · PDF
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behaviour
Daniel Aarao Reis Arturi, Eric Zhang, Andrew Adrian Ansah, Kevin Zhu, Ashwinee Panda, Aishwarya Balwani · PDF
Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence
Bofan Gong, Shiyang Lai, Dawn Song · PDF
Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors
Viacheslav Sinii, Nikita Balagansky, Yaroslav Aksenov, Vadim Kurochkin, Daniil Laptev, Alexey Gorbatovski, Boris Shaposhnikov, Daniil Gavrilov · PDF
Some Attention is All You Need for Retrieval
Felix Michalak, Steven Abreu · PDF
Sparse Autoencoders Trained on the Same Data Learn Different Features
Gonçalo Paulo, Nora Belrose · PDF
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
David Chanin, Adrià Garriga-Alonso · PDF
Spectral Dynamics in Neural Network Training: Mathematical Foundations for Understanding Representational Development
Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, Anirudh Gajula · PDF
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda · PDF
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda · PDF
Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
J Rosser, Jose Luis Redondo Garcia, Gustavo Penha, Konstantina Palla, Hugues Bouchard · PDF
SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals
Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong · PDF
Superposition in Mixture of Experts
Marmik Chaudhari, Jeremi Nuer, Rome Thorstenson · PDF
Symbolic Policy Distillation for Interpretable Reinforcement Learning
Peilang Li, Umer Siddique, Yongcan Cao · PDF
Symbolic vs. Continuous Features in Transformers: A Digital Communication System's Explanation
Kan Deng · PDF
The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
Jeremias Lino Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo · PDF
The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning
Siyi Chen, Yimeng Zhang, Sijia Liu, Qing Qu · PDF
The Geometry of Self-Verification in a Task-Specific Reasoning Model
Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg · PDF
The Impossibility of Inverse Permutation Learning in Transformer Models
Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah · PDF
Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs
Hanqi Yan, Hainiu Xu, Yulan He · PDF
Thought Anchors: Which LLM Reasoning Steps Matter?
Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy · PDF
Thought Branches: Interpreting LLM Reasoning Requires Resampling
Uzay Macar, Paul C. Bogdan, Senthooran Rajamanoharan, Neel Nanda · PDF
Three Desiderata for Faithfulness in Machine Learning Explanations: The Case for Causal Abstraction
Mette Friis Andersen, Maria Heuss, Ana Lucic · PDF
Token Entanglement in Subliminal Learning
Amir Zur, Zhuofan Ying, Alexander Russell Loftus, Kerem Şahin, Steven Yu, Lucia Quirke, Tamar Rott Shaham, Natalie Shapira, Hadas Orgad, David Bau · PDF
TopKLoRA
Marek Masiak, Lukas Vierling, Christian Schroeder de Witt, Nicola Cancedda, Constantin Venhoff · PDF
Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research
Sean Trott · PDF
Towards a Mechanistic Understanding of Robustness in Finetuned Reasoning Models
Aashiq Muhamed, Xuandong Zhao, Mona T. Diab, Virginia Smith, Dawn Song · PDF
Towards Understanding Multimodal Fine-Tuning: A Case Study into Spatial Features
Lachin Naghashyar, Hunar Batra, Ashkan Khakzar, Philip Torr, Ronald Clark, Christian Schroeder de Witt, Constantin Venhoff · PDF
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, Xipeng Qiu · PDF
Training Reliable Activation Probes With a Handful of Positive Examples
Riya Tyagi, Stefan Heimersheim · PDF
Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability
Luca Baroni, Galvin Khara, Joachim Schaeffer, Marat Subkhankulov, Stefan Heimersheim · PDF
Trilemma of Truth in Large Language Models
Germans Savcisens, Tina Eliassi-Rad · PDF
Uncovering Object Localization Mechanisms in VLMs
Timothy Schaumlöffel, Martina G. Vilas, Gemma Roig · PDF
Understanding sparse autoencoder scaling in the presence of feature manifolds
Eric J Michaud, Liv Gorton, Tom McGrath · PDF
Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact
Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien · PDF
Unsupervised decoding of encoded reasoning using language model interpretability
Ching Fang, Samuel Marks · PDF
Unveiling the Latent Directions of Reflection in Large Language Models
Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu · PDF
Vector Arithmetic in Concept and Token Subspaces
Sheridan Feucht, Byron C Wallace, David Bau · PDF
Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
Xinyuan Yan, Shusen Liu, Kowshik Thopalli, Bei Wang · PDF
WASP: A Weight-Space Approach to Detecting Learned Spuriousness
Cristian Daniel Paduraru, Antonio Barbalau, Radu Filipescu, Andrei Liviu Nicolicioiu, Elena Burceanu · PDF
Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
Ziqian Zhong, Aditi Raghunathan · PDF
What Affects the Effective Depth of Large Language Models?
Yi Hu, Cai Zhou, Muhan Zhang · PDF
What Do Refusal Tokens Learn? Fine-Grained Representations and Evidence for Downstream Steering
Rishab Alagharu, Ishneet Sukhvinder Singh, Anjali Batta, Jaelyn S. Liang, Shaibi Shamsudeen, Arnav Sheth, Kevin Zhu, Ashwinee Panda, Zhen Wu · PDF
When seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models
Francesco Ortu, Zhijing Jin, Diego Doimo, Alberto Cazzaniga · PDF
Where's the Bug? Attention Probing for Scalable Fault Localization
Adam Stein, Arthur Wayne, Aaditya Naik, Mayur Naik, Eric Wong · PDF
Who is In Charge? Dissecting Role Conflicts in LLM Instruction Following
Siqi Zeng · PDF

Accepted papers (187)

☆Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis

☆Activation Transport Operators

☆Adaptive Task Vectors for Large Language Models

☆Adversarial Attacks Leverage Interference Between Features in Superposition

☆Adversarial Examples Are Not Bugs, They Are Superposition

☆Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

☆Angular Steering: Behavior Control via Rotation in Activation Space

☆Attention Layers Add Into Low-Dimensional Residual Subspaces

☆Attention Pattern Discovery at Scale

☆Attributing Response to Context: A Jensen–Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

☆Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

☆Automatically Finding Rule-Based Neurons in OthelloGPT

☆Base Models Know How to Reason, Thinking Models Learn When

☆Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions

☆Better World Models Can Lead to Better Post-Training Performance

☆Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality

☆Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

☆Bilinear Convolution Decomposition for Causal RL Interpretability

☆Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed

☆Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

☆Can Interpretation Predict Behavior on Unseen Data?

☆Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

☆Causal Discovery and Inference through Next-Token Prediction

☆Centroid Affinity: How Deep Networks Represent Features

☆Circuit-Tracer: A New Library for Finding Feature Circuits

☆Comparing Clinical and General LLMs on Knowledge Boundaries and Robustness

☆Composable Sparse Subnetworks via Maximum-Entropy Principle

☆Compressed Computation is (probably) not Computation in Superposition

☆Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem

☆Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios

☆ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation

☆Control and Predictivity in Neural Interpretability

☆Controlling Vision–Language–Action Policies through Sparse Latent Directions

☆Convergent Linear Representations of Emergent Misalignment

☆Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition

☆Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

☆Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

☆Decomposing Attention To Find Context-Sensitive Neurons

☆Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

☆Decomposition of Small Transformer Models

☆Demystifying Cipher-Following in Large Language Models via Activation Analysis

☆Dense SAE Latents Are Features, Not Bugs

☆Detecting and Characterizing Planning in Language Models

☆Detecting Motivated Reasoning in the Internal Representations of Language Models

☆Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

☆Do Natural Language Descriptions of Model Activations Convey Privileged Information?

☆Do We Always Need Sampling? Eliciting Numerical Predictive Distributions of LLMs Without Auto-Regression

☆Does FLUX Know What It’s Writing?

☆Don't Believe the Belief Hype!

☆Dual Mechanisms of Value Expression: Decomposing Intrinsic and Prompted Values in Language Models

☆Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

☆Eliciting Secret Knowledge from Language Models

☆Emergence of Linear Truth Encodings in Language Models

☆Emergent Specialization: Rare Token Neurons in Language Models

☆Emergent World Beliefs: Exploring Transformers in Stochastic Games

☆Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models

☆Enforcing Orderedness in SAEs to Improve Feature Consistency

☆Entity Multiplexing Through Activation Strength: Understanding goals in A Maze Solving Agent

☆Equivalent Linear Mappings of Large Language Models

☆Evaluating Explanatory Evaluations: An Explanatory Virtues Framework for Mechanistic Interpretability

☆Evaluating SAE interpretability without explanations

☆Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability

☆False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

☆Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

☆Feature interactions in sparse crosscoders from compact proofs

☆Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

☆Finding Manifolds with Bilinear Autoencoders

☆Fluid Reasoning Representations

☆From Black-box to Causal-box: Towards Building More Interpretable Models

☆From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

☆From Local to Contextually-Enriched Local Representations: A Mechanism for Holistic Processing in DINOv2 ViTs

☆From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

☆From Tokens to Semantics: The Emergence and Stabilization of Polysemanticity in Language Models

☆Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

☆Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders

☆Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning

☆Head Pursuit: Probing Attention Specialization in Multimodal Transformers

☆Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

☆Higher-Order Component Attribution via Kolmogorov–Arnold Networks

Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis

Activation Transport Operators

Adaptive Task Vectors for Large Language Models

Adversarial Attacks Leverage Interference Between Features in Superposition

Adversarial Examples Are Not Bugs, They Are Superposition

Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

Angular Steering: Behavior Control via Rotation in Activation Space

Attention Layers Add Into Low-Dimensional Residual Subspaces

Attention Pattern Discovery at Scale

Attributing Response to Context: A Jensen–Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

Automatically Finding Rule-Based Neurons in OthelloGPT

Base Models Know How to Reason, Thinking Models Learn When

Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions

Better World Models Can Lead to Better Post-Training Performance

Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality

Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

Bilinear Convolution Decomposition for Causal RL Interpretability

Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed

Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

Can Interpretation Predict Behavior on Unseen Data?

Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

Causal Discovery and Inference through Next-Token Prediction

Centroid Affinity: How Deep Networks Represent Features

Circuit-Tracer: A New Library for Finding Feature Circuits

Comparing Clinical and General LLMs on Knowledge Boundaries and Robustness

Composable Sparse Subnetworks via Maximum-Entropy Principle

Compressed Computation is (probably) not Computation in Superposition

Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem

Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios

ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation

Control and Predictivity in Neural Interpretability

Controlling Vision–Language–Action Policies through Sparse Latent Directions

Convergent Linear Representations of Emergent Misalignment

Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition

Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Decomposing Attention To Find Context-Sensitive Neurons

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Decomposition of Small Transformer Models

Demystifying Cipher-Following in Large Language Models via Activation Analysis

Dense SAE Latents Are Features, Not Bugs

Detecting and Characterizing Planning in Language Models

Detecting Motivated Reasoning in the Internal Representations of Language Models

Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Do We Always Need Sampling? Eliciting Numerical Predictive Distributions of LLMs Without Auto-Regression

Does FLUX Know What It’s Writing?

Don't Believe the Belief Hype!

Dual Mechanisms of Value Expression: Decomposing Intrinsic and Prompted Values in Language Models

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Eliciting Secret Knowledge from Language Models

Emergence of Linear Truth Encodings in Language Models

Emergent Specialization: Rare Token Neurons in Language Models

Emergent World Beliefs: Exploring Transformers in Stochastic Games

Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models

Enforcing Orderedness in SAEs to Improve Feature Consistency

Entity Multiplexing Through Activation Strength: Understanding goals in A Maze Solving Agent

Equivalent Linear Mappings of Large Language Models

Evaluating Explanatory Evaluations: An Explanatory Virtues Framework for Mechanistic Interpretability

Evaluating SAE interpretability without explanations

Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

Feature interactions in sparse crosscoders from compact proofs

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

Finding Manifolds with Bilinear Autoencoders

Fluid Reasoning Representations

From Black-box to Causal-box: Towards Building More Interpretable Models

From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

From Local to Contextually-Enriched Local Representations: A Mechanism for Holistic Processing in DINOv2 ViTs

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

From Tokens to Semantics: The Emergence and Stabilization of Polysemanticity in Language Models

Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders

Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning

Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

Higher-Order Component Attribution via Kolmogorov–Arnold Networks

How does Mamba Perform Associative Recall? A Mechanistic Study