NeurIPS 2025 Past Interpretability

Mechanistic Interpretability Workshop at NeurIPS 2025

Mech Interp Workshop (NeurIPS 2025)

Submission deadline
Aug 23, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (187)

Fetched from OpenReview (v2) on 2026-06-10.

  1. Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis

    Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell · PDF
  2. Activation Transport Operators

    Andrzej Szablewski, Marek Masiak · PDF
  3. Adaptive Task Vectors for Large Language Models

    Joonseong Kang, Soojeong Lee, Sumin Park, Subeen Park, Taero Kim, Jihee Kim, Ryunyi LEE, Kyungwoo Song · PDF
  4. Adversarial Attacks Leverage Interference Between Features in Superposition

    Edward Stevinson, Lucas Prieto, Melih Barsbey, Tolga Birdal · PDF
  5. Adversarial Examples Are Not Bugs, They Are Superposition

    Liv Gorton, Owen Lewis · PDF
  6. Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

    Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · PDF
  7. Angular Steering: Behavior Control via Rotation in Activation Space

    Hieu M. Vu, Tan Minh Nguyen · PDF
  8. Attention Layers Add Into Low-Dimensional Residual Subspaces

    Junxuan Wang, Xuyang Ge, Wentao Shu, Zhengfu He, Xipeng Qiu · PDF
  9. Attention Pattern Discovery at Scale

    Jonathan Katzy, Razvan Mihai Popescu, Erik Mekkes, Arie van Deursen, Maliheh Izadi · PDF
  10. Attributing Response to Context: A Jensen–Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

    Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz · PDF
  11. Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

    Christy Li, Josep Lopez Camuñas, Jake Thomas Touchet, Jacob Andreas, Agata Lapedriza, Antonio Torralba, Tamar Rott Shaham · PDF
  12. Automatically Finding Rule-Based Neurons in OthelloGPT

    Aditya Singh, Zihang Wen, Srujananjali Medicherla, Adam Karvonen, Can Rager · PDF
  13. Base Models Know How to Reason, Thinking Models Learn When

    Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda · PDF
  14. Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions

    Dat Minh Hong, Bruno Kacper Mlodozeniec, Runa Eschenhagen, Richard E. Turner · PDF
  15. Better World Models Can Lead to Better Post-Training Performance

    Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, Andrew Lee · PDF
  16. Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality

    Lingjing Kong, Shaoan Xie, Guangyi Chen, Yuewen Sun, Xiangchen Song, Eric P. Xing, Kun Zhang · PDF
  17. Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

    Ej Zhou, Caiqi Zhang, Tiancheng Hu, Chengzu Li, Nigel Collier, Ivan Vulić, Anna Korhonen · PDF
  18. Bilinear Convolution Decomposition for Causal RL Interpretability

    Sinem Erisken, Alice Rigg, Narmeen Fatimah Oozeer · PDF
  19. Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed

    Michał Brzozowski · PDF
  20. Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators

    Dani Roytburg, Matthew Nguyen, Matthew Bozoukov, Jou Barzdukas, Hongyu Fu, Narmeen Fatimah Oozeer · PDF
  21. Can Interpretation Predict Behavior on Unseen Data?

    Victoria R Li, Jenny Kaufmann, Martin Wattenberg, David Alvarez-Melis, Naomi Saphra · PDF
  22. Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework

    Laura Kopf, Nils Feldhus, Kirill Bykov, Philine Lou Bommer, Anna Hedström, Marina MC Höhne, Oliver Eberle · PDF
  23. Causal Discovery and Inference through Next-Token Prediction

    Eivinas Butkus, Nikolaus Kriegeskorte · PDF
  24. Centroid Affinity: How Deep Networks Represent Features

    Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk · PDF
  25. Circuit-Tracer: A New Library for Finding Feature Circuits

    Michael Hanna, Mateusz Piotrowski, Jack Lindsey, Emmanuel Ameisen · PDF
  26. Comparing Clinical and General LLMs on Knowledge Boundaries and Robustness

    Xingmeng Zhao, Ke Yang, Anthony Rios · PDF
  27. Composable Sparse Subnetworks via Maximum-Entropy Principle

    Francesco Caso, Samuele Fonio, Nicola Saccomanno, Simone Monaco, Fabrizio Silvestri · PDF
  28. Compressed Computation is (probably) not Computation in Superposition

    Jai Bhagat, Sara Molas-Medina, Giorgi Giglemiani, Stefan Heimersheim · PDF
  29. Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem

    Adam Newgas · PDF
  30. Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios

    Isha Agarwal, Saharsha Navani, Fazl Barez · PDF
  31. ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation

    Robert Graham, Edward Stevinson, Leo Richter, Alexander Chia, Joseph Miller, Joseph Isaac Bloom · PDF
  32. Control and Predictivity in Neural Interpretability

    Satchel Grant, Alexa R. Tartaglini · PDF
  33. Controlling Vision–Language–Action Policies through Sparse Latent Directions

    Momin Ahmad Khan, Novak Boskov, Fatima M. Anwar, Manzoor A. Khan · PDF
  34. Convergent Linear Representations of Emergent Misalignment

    Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda · PDF
  35. Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition

    Lucas Prieto, Edward Stevinson, Melih Barsbey, Tolga Birdal, Pedro A. M. Mediano · PDF
  36. Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs

    Thomas Jiralerspong, Trenton Bricken · PDF
  37. Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

    McNair Shah, Saleena Angeline Sartawita, Adhitya Rajendra Kumar, Naitik Chheda, Will Cai, Kevin Zhu, Sean O'Brien, Vasu Sharma · PDF
  38. Decomposing Attention To Find Context-Sensitive Neurons

    Alex Gibson · PDF
  39. Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

    Xinting Huang, Michael Hahn · PDF
  40. Decomposition of Small Transformer Models

    Casper L. Christensen, Logan Riggs Smith · PDF
  41. Demystifying Cipher-Following in Large Language Models via Activation Analysis

    Megan Gross, Yigitcan Kaya, Christopher Kruegel, Giovanni Vigna · PDF
  42. Dense SAE Latents Are Features, Not Bugs

    Xiaoqing Sun, Alessandro Stolfo, Joshua Engels, Ben Peng Wu, Senthooran Rajamanoharan, Mrinmaya Sachan, Max Tegmark · PDF
  43. Detecting and Characterizing Planning in Language Models

    Jatin Nainani, Sankaran Vaidyanathan, Connor Watts, Andre N. Assis, Alice Rigg · PDF
  44. Detecting Motivated Reasoning in the Internal Representations of Language Models

    Parsa Mirtaheri, Mikhail Belkin · PDF
  45. Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

    Hao Gu, Vibhas Nair, Amrithaa Ashok Kumar, Ryan Lagasse, Kevin Zhu, Sean O'Brien, Ashwinee Panda · PDF
  46. Do Natural Language Descriptions of Model Activations Convey Privileged Information?

    Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C Wallace · PDF
  47. Do We Always Need Sampling? Eliciting Numerical Predictive Distributions of LLMs Without Auto-Regression

    Julianna Piskorz, Kasia Kobalczyk, Mihaela van der Schaar · PDF
  48. Does FLUX Know What It’s Writing?

    Adrian Chang, Sheridan Feucht, Byron C Wallace, David Bau · PDF
  49. Don't Believe the Belief Hype!

    Alessandro Corona Mendozza · PDF
  50. Dual Mechanisms of Value Expression: Decomposing Intrinsic and Prompted Values in Language Models

    Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo · PDF
  51. Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

    Lorenz Hufe, Constantin Venhoff, Maximilian Dreyer, Erblina Purelku, Sebastian Lapuschkin, Wojciech Samek · PDF
  52. Eliciting Secret Knowledge from Language Models

    Bartosz Cywiński, Emil Ryd, Rowan Wang, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy, Samuel Marks · PDF
  53. Emergence of Linear Truth Encodings in Language Models

    Shauli Ravfogel, Gilad Yehudai, Tal Linzen, Joan Bruna, Alberto Bietti · PDF
  54. Emergent Specialization: Rare Token Neurons in Language Models

    Jing Liu, Yueheng Li, Haozheng Wang · PDF
  55. Emergent World Beliefs: Exploring Transformers in Stochastic Games

    Adam Kamel, Tanish Rastogi, Michael Ma, Kailash Ranganathan, Kevin Zhu · PDF
  56. Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models

    Eric Lacosse, Mariana Duarte, Peter Todd, Daniel C McNamee · PDF
  57. Enforcing Orderedness in SAEs to Improve Feature Consistency

    Sophie L. Wang, Alex Quach, Nithin Parsan, John Jingxuan Yang · PDF
  58. Entity Multiplexing Through Activation Strength: Understanding goals in A Maze Solving Agent

    Benjamin Sturgeon, Jonathan P. Shock · PDF
  59. Equivalent Linear Mappings of Large Language Models

    James Robert Golden · PDF
  60. Evaluating Explanatory Evaluations: An Explanatory Virtues Framework for Mechanistic Interpretability

    Kola Ayonrinde, Louis Jaburi · PDF
  61. Evaluating SAE interpretability without explanations

    Gonçalo Paulo, Nora Belrose · PDF
  62. Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability

    Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng · PDF
  63. False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

    Cheng Wang, Zeming Wei, Qin Liu, Wenxuan Zhou, Muhao Chen · PDF
  64. Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

    David Chanin, Tomáš Dulka, Adrià Garriga-Alonso · PDF
  65. Feature interactions in sparse crosscoders from compact proofs

    Dmitry Manning-Coe, Thomas Read, Anna Soligo, Oliver Clive-Griffin, Chun Hei Yip, Alex Gibson, Rajashree Agrawal, Jason Gross · PDF
  66. Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

    Samaksh Bhargav, Zining Zhu · PDF
  67. Finding Manifolds with Bilinear Autoencoders

    Thomas Dooms, Ward Gauderis · PDF
  68. Fluid Reasoning Representations

    Dmitrii Kharlapenko, Alessandro Stolfo, Arthur Conmy, Mrinmaya Sachan, Zhijing Jin · PDF
  69. From Black-box to Causal-box: Towards Building More Interpretable Models

    Inwoo Hwang, Yushu Pan, Elias Bareinboim · PDF
  70. From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

    Karim Saraipour, Shichang Zhang · PDF
  71. From Local to Contextually-Enriched Local Representations: A Mechanism for Holistic Processing in DINOv2 ViTs

    Fenil R. Doshi, Thomas Fel, Talia Konkle, George A. Alvarez · PDF
  72. From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs

    Erum Mushtaq, Anil Ramakrishna, Satyapriya Krishna, Sattvik Sahai, Prasoon Goyal, Kai-Wei Chang, Tao Zhang, Rahul Gupta · PDF
  73. From Tokens to Semantics: The Emergence and Stabilization of Polysemanticity in Language Models

    Sharvil Limaye, Aniruddhan Ramesh, Aiden Zhou, Akshay Bhaskar, Jonas Rohweder, Ashwinee Panda, Vasu Sharma · PDF
  74. Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition

    Qinyuan Ye, Robin Jia, Xiang Ren · PDF
  75. Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders

    Ege Erdogan, Ana Lucic · PDF
  76. Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning

    Wannan Yang, Xinchi Qiu, Lei Yu, Yuchen Zhang, Aobo Yang, Narine Kokhlikyan, Nicola Cancedda, Diego Garcia-Olano · PDF
  77. Head Pursuit: Probing Attention Specialization in Multimodal Transformers

    Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga · PDF
  78. Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

    Brady Bhalla, Honglu Fan, Nancy Chen, Tony Yue YU · PDF
  79. Higher-Order Component Attribution via Kolmogorov–Arnold Networks

    Samy Mammeri, Christian Gagné · PDF
  80. How does Mamba Perform Associative Recall? A Mechanistic Study

    Grégoire LE CORRE, Ningyuan Huang, Alberto Bietti · PDF
  81. Instruction Following by Boosting Attention of Large Language Models

    Vitoria Guardieiro, Avishree Khare, Adam Stein, Eric Wong · PDF
  82. InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation

    Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu · PDF
  83. Interpretability at the Network Level: Prior-Guided Drift Diffusion for Neural Circuit Analysis

    Tahereh Toosi · PDF
  84. Interpretability for Time Series Transformers using A Concept Bottleneck Framework

    Angela van Sprang, Erman Acar, Willem Zuidema · PDF
  85. Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

    Nicholas Jiang, Xiaoqing Sun, Lisa Dunlap, Lewis Smith, Neel Nanda · PDF
  86. Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision–Language Models

    Jinyeong Kim, Seil Kang, Jiwoo Park, Junhyeok Kim, Seong Jae Hwang · PDF
  87. Interpreting ResNet-based CLIP via Neuron-Attention Decomposition

    Edmund Bu, Yossi Gandelsman · PDF
  88. Interpreting Vision Grounding in Vision-Language Models: A Case Study in Coordinate Prediction

    Clement Neo, Yongsen Zheng, Kwok-Yan Lam, Luke Ong · PDF
  89. Iterative Inference in a Chess-Playing Neural Network

    Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek · PDF
  90. Just-in-time and distributed task representations in language models

    Yuxuan Li, Declan Iain Campbell, Stephanie C.Y. Chan, Andrew Kyle Lampinen · PDF
  91. Language Models use Lookbacks to Track Beliefs

    Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger · PDF
  92. Latent Crystallographic Microscope: Probing the Emergent Crystallographic Knowledge in Large Language Models

    Jingru Gan, Yanqiao Zhu, Wei Wang · PDF
  93. Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations

    Mauri Diaz · PDF
  94. Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

    Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow · PDF
  95. Learning to Steer: Input-dependent Steering for Multimodal LLMs

    Jayneel Parekh, Pegah KHAYATAN, Mustafa Shukor, Arnaud Dapogny, Alasdair Newson, Matthieu Cord · PDF
  96. LLM Pretraining with Continuous Concepts

    Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason E Weston, Xian Li · PDF
  97. LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS

    Stefan F. Schouten, Peter Bloem · PDF
  98. Localizing Reasoning Training-Induced Changes in Large Language Models

    Max Klabunde, Florian Lemmerich · PDF
  99. Looking into Black Box Code Language Models

    Muhammad Umair Haider, Umar Farooq, A.B. Siddique, Mark Marron · PDF
  100. Mapping Faithful Reasoning in Language Models

    Jiazheng Li, Andreas Damianou, J Rosser, Jose Luis Redondo Garcia, Konstantina Palla · PDF
  101. Measuring Sparse Autoencoder Feature Sensitivity

    Claire Tian, Katherine Tian, Nathan Zixia Hu · PDF
  102. Mechanistic Evaluation of Transformers and State-Space Models

    Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csordás, Dan Jurafsky, Christopher Potts · PDF
  103. Mechanistic evidence that motif-gated domain recognition drives contact prediction in protein language models

    Jatin Nainani, Bryn Marie Reimer, Connor Watts, David Jensen, Anna G. Green · PDF
  104. Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

    Maxime Méloux, François Portet, Maxime Peyrard · PDF
  105. Mitigating Emergent Misalignment with Data Attribution

    Louis Jaburi, Gonçalo Paulo, Stepan Shabalin, Lucia Quirke, Nora Belrose · PDF
  106. Mitigating Sycophancy in Language Models via Sparse Activation Fusion and Multi-Layer Activation Steering

    Pyae Phoo Min, Avigya Paudel, Naufal Adityo, Arthur Zhu, Andrew Rufail, Cole Blondin, Kevin Zhu, Sunishchal Dev, Sean O'Brien · PDF
  107. Model Organisms for Emergent Misalignment

    Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, Neel Nanda · PDF
  108. Motifs in Attention Patterns of Large Language Models

    Michael Ivanitskiy, Cecilia Diniz Behn, Samy Wu Fung · PDF
  109. Multimodal Concept Bottleneck Models

    Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng · PDF
  110. Multiple Streams of Knowledge Retrieval: Enriching and Recalling in Transformers

    Todd Nief, David Reber, Sean M. Richardson, Ari Holtzman · PDF
  111. Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences

    Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda · PDF
  112. Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

    Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A.B. Siddique · PDF
  113. nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers

    Clément Dumas · PDF
  114. On the Geometry and Topology of Neural Circuits for Modular Addition

    Gabriela Moisescu-Pareja, Gavin McCracken, Harley Wiltzer, Colin Daniels, Vincent Létourneau, Jonathan Love · PDF
  115. On the Limits of Linear Representation Hypotheses in Large Language Models: A Dynamical Systems Analysis

    Abhinav Muraleedharan · PDF
  116. Open-Vocabulary Natural-Language Explanations of LLM Activations via Soft Prompts

    Bart Bussmann · PDF
  117. OpenMAIA: a Multimodal Automated Interpretability Agent based on open-source models

    Josep Lopez Camuñas, Christy Li, Tamar Rott Shaham, Antonio Torralba, Agata Lapedriza · PDF
  118. Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

    Clément Dumas, Julian Minder, Caden Juang, Bilal Chughtai, Neel Nanda · PDF
  119. Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

    Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso · PDF
  120. Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model

    Rio Alexa Fear, Payel Mukhopadhyay, Michael McCabe, Alberto Bietti, Miles Cranmer · PDF
  121. PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage

    Krishna Kanth Nakka, Dmitrii Usynin, Xue Jiang, Xuebing Zhou · PDF
  122. Pinpointing Attention-Causal Communication in Language Models

    Gabriel Franco, Mark Crovella · PDF
  123. Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

    Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang · PDF
  124. Predicting Weak-to-Strong Generalization from Latent Representations

    Ben Wilop, Christian Schroeder de Witt, Yarin Gal, Philip Torr, Constantin Venhoff · PDF
  125. Probing by Analogy: Decomposing Probes into Activations for Better Interpretability and Inter-Model Generalization

    Patrick Leask, Noura Al Moubayed · PDF
  126. Quiet Feature Learning in Algorithmic Tasks

    Prudhviraj Naidu, Zixian Wang, Leon Bergen, Ramamohan Paturi · PDF
  127. Rank-1 LoRAs Encode Interpretable Reasoning Signals

    Jake Ward, Paul M. Riechers, Adam Shai · PDF
  128. ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability

    Chung-En Sun, Ge Yan, Akshay R. Kulkarni, Tsui-Wei Weng · PDF
  129. ReflCtrl: Controlling LLM Reflection via Representation Engineering

    Ge Yan, Chung-En Sun, Tsui-Wei Weng · PDF
  130. RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching

    Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, Neel Nanda · PDF
  131. Representation Similarity Reveals Implicit Layer Grouping in Neural Networks

    Tian Gao, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Dennis Wei · PDF
  132. Rethinking Crowd-Sourced Evaluation of Neuron Explanations

    Tuomas Oikarinen, Ge Yan, Akshay R. Kulkarni, Tsui-Wei Weng · PDF
  133. Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone

    Antonio Barbalau, Cristian Daniel Paduraru, Teodor Poncu, Alexandru Tifrea, Elena Burceanu · PDF
  134. Reverse Engineering a Stateful Reasoning Circuit

    Akshit Kumar, Dipti Sharma, Parameswari Krishnamurthy · PDF
  135. Reverse-Engineering Memory in DreamerV3: From Sparse Representations to Functional Circuits

    Jan Sobotka, Auke Ijspeert, Guillaume Bellegarda · PDF
  136. RippleBench: Capturing Ripple Effects by Leveraging Existing Knowledge Repositories

    Roy Rinberg, Usha Bhalla, Igor Shilov, Rohit Gandikota · PDF
  137. RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?

    Rohan Gupta, Erik Jenner · PDF
  138. Robustly Improving LLM Fairness in Realistic Settings via Interpretability

    Adam Karvonen, Samuel Marks · PDF
  139. SAE-ception: Iteratively Using Sparse Autoencoders as a Training Signal

    Alex Bishka · PDF
  140. Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

    Yaniv Nikankin, Dana Arad, Yossi Gandelsman, Yonatan Belinkov · PDF
  141. Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models

    Sayam Goyal, Brad Peters, María Emilia Granda, Akshath Vijayakumar Narmadha, Dharunish Yugeswardeenoo, Callum Stuart McDougall, Sean O'Brien, Ashwinee Panda, Kevin Zhu, Cole Blondin · PDF
  142. Shared Memorization Structures in Transformers Revealed by Loss Curvature

    Jack Merullo, Srihita Vatsavaya, Owen Lewis · PDF
  143. Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behaviour

    Daniel Aarao Reis Arturi, Eric Zhang, Andrew Adrian Ansah, Kevin Zhu, Ashwinee Panda, Aishwarya Balwani · PDF
  144. Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence

    Bofan Gong, Shiyang Lai, Dawn Song · PDF
  145. Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

    Viacheslav Sinii, Nikita Balagansky, Yaroslav Aksenov, Vadim Kurochkin, Daniil Laptev, Alexey Gorbatovski, Boris Shaposhnikov, Daniil Gavrilov · PDF
  146. Some Attention is All You Need for Retrieval

    Felix Michalak, Steven Abreu · PDF
  147. Sparse Autoencoders Trained on the Same Data Learn Different Features

    Gonçalo Paulo, Nora Belrose · PDF
  148. Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

    David Chanin, Adrià Garriga-Alonso · PDF
  149. Spectral Dynamics in Neural Network Training: Mathematical Foundations for Understanding Representational Development

    Brian Richard Olsen, Sam Fatehmanesh, Frank Xiao, Adarsh Kumarappan, Anirudh Gajula · PDF
  150. Steering Evaluation-Aware Language Models to Act Like They Are Deployed

    Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda · PDF
  151. Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

    Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda · PDF
  152. Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

    J Rosser, Jose Luis Redondo Garcia, Gustavo Penha, Konstantina Palla, Hugues Bouchard · PDF
  153. SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals

    Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong · PDF
  154. Superposition in Mixture of Experts

    Marmik Chaudhari, Jeremi Nuer, Rome Thorstenson · PDF
  155. Symbolic Policy Distillation for Interpretable Reinforcement Learning

    Peilang Li, Umer Siddique, Yongcan Cao · PDF
  156. Symbolic vs. Continuous Features in Transformers: A Digital Communication System's Explanation

    Kan Deng · PDF
  157. The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

    Jeremias Lino Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo · PDF
  158. The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning

    Siyi Chen, Yimeng Zhang, Sijia Liu, Qing Qu · PDF
  159. The Geometry of Self-Verification in a Task-Specific Reasoning Model

    Andrew Lee, Lihao Sun, Chris Wendler, Fernanda Viégas, Martin Wattenberg · PDF
  160. The Impossibility of Inverse Permutation Learning in Transformer Models

    Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah · PDF
  161. Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs

    Hanqi Yan, Hainiu Xu, Yulan He · PDF
  162. Thought Anchors: Which LLM Reasoning Steps Matter?

    Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy · PDF
  163. Thought Branches: Interpreting LLM Reasoning Requires Resampling

    Uzay Macar, Paul C. Bogdan, Senthooran Rajamanoharan, Neel Nanda · PDF
  164. Three Desiderata for Faithfulness in Machine Learning Explanations: The Case for Causal Abstraction

    Mette Friis Andersen, Maria Heuss, Ana Lucic · PDF
  165. Token Entanglement in Subliminal Learning

    Amir Zur, Zhuofan Ying, Alexander Russell Loftus, Kerem Şahin, Steven Yu, Lucia Quirke, Tamar Rott Shaham, Natalie Shapira, Hadas Orgad, David Bau · PDF
  166. TopKLoRA

    Marek Masiak, Lukas Vierling, Christian Schroeder de Witt, Nicola Cancedda, Constantin Venhoff · PDF
  167. Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research

    Sean Trott · PDF
  168. Towards a Mechanistic Understanding of Robustness in Finetuned Reasoning Models

    Aashiq Muhamed, Xuandong Zhao, Mona T. Diab, Virginia Smith, Dawn Song · PDF
  169. Towards Understanding Multimodal Fine-Tuning: A Case Study into Spatial Features

    Lachin Naghashyar, Hunar Batra, Ashkan Khakzar, Philip Torr, Ronald Clark, Christian Schroeder de Witt, Constantin Venhoff · PDF
  170. Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition

    Zhengfu He, Junxuan Wang, Rui Lin, Xuyang Ge, Wentao Shu, Qiong Tang, Junping Zhang, Xipeng Qiu · PDF
  171. Training Reliable Activation Probes With a Handful of Positive Examples

    Riya Tyagi, Stefan Heimersheim · PDF
  172. Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability

    Luca Baroni, Galvin Khara, Joachim Schaeffer, Marat Subkhankulov, Stefan Heimersheim · PDF
  173. Trilemma of Truth in Large Language Models

    Germans Savcisens, Tina Eliassi-Rad · PDF
  174. Uncovering Object Localization Mechanisms in VLMs

    Timothy Schaumlöffel, Martina G. Vilas, Gemma Roig · PDF
  175. Understanding sparse autoencoder scaling in the presence of feature manifolds

    Eric J Michaud, Liv Gorton, Tom McGrath · PDF
  176. Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact

    Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien · PDF
  177. Unsupervised decoding of encoded reasoning using language model interpretability

    Ching Fang, Samuel Marks · PDF
  178. Unveiling the Latent Directions of Reflection in Large Language Models

    Fu-Chieh Chang, Yu-Ting Lee, Pei-Yuan Wu · PDF
  179. Vector Arithmetic in Concept and Token Subspaces

    Sheridan Feucht, Byron C Wallace, David Bau · PDF
  180. Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts

    Xinyuan Yan, Shusen Liu, Kowshik Thopalli, Bei Wang · PDF
  181. WASP: A Weight-Space Approach to Detecting Learned Spuriousness

    Cristian Daniel Paduraru, Antonio Barbalau, Radu Filipescu, Andrei Liviu Nicolicioiu, Elena Burceanu · PDF
  182. Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

    Ziqian Zhong, Aditi Raghunathan · PDF
  183. What Affects the Effective Depth of Large Language Models?

    Yi Hu, Cai Zhou, Muhan Zhang · PDF
  184. What Do Refusal Tokens Learn? Fine-Grained Representations and Evidence for Downstream Steering

    Rishab Alagharu, Ishneet Sukhvinder Singh, Anjali Batta, Jaelyn S. Liang, Shaibi Shamsudeen, Arnav Sheth, Kevin Zhu, Ashwinee Panda, Zhen Wu · PDF
  185. When seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

    Francesco Ortu, Zhijing Jin, Diego Doimo, Alberto Cazzaniga · PDF
  186. Where's the Bug? Attention Probing for Scalable Fault Localization

    Adam Stein, Arthur Wayne, Aaditya Naik, Mayur Naik, Eric Wong · PDF
  187. Who is In Charge? Dissecting Role Conflicts in LLM Instruction Following

    Siqi Zeng · PDF