NeurIPS 2025 Past Interpretability
Mechanistic Interpretability Workshop at NeurIPS 2025
Mech Interp Workshop (NeurIPS 2025)
- Submission deadline
- Aug 23, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (187)
Fetched from OpenReview (v2) on 2026-06-10.
-
Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis
-
Activation Transport Operators
-
Adaptive Task Vectors for Large Language Models
-
Adversarial Attacks Leverage Interference Between Features in Superposition
-
Adversarial Examples Are Not Bugs, They Are Superposition
-
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
-
Angular Steering: Behavior Control via Rotation in Activation Space
-
Attention Layers Add Into Low-Dimensional Residual Subspaces
-
Attention Pattern Discovery at Scale
-
Attributing Response to Context: A Jensen–Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
-
Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent
-
Automatically Finding Rule-Based Neurons in OthelloGPT
-
Base Models Know How to Reason, Thinking Models Learn When
-
Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions
-
Better World Models Can Lead to Better Post-Training Performance
-
Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality
-
Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models
-
Bilinear Convolution Decomposition for Causal RL Interpretability
-
Bimodality of Sparse Autoencoder Features is Still There and Can Be Fixed
-
Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators
-
Can Interpretation Predict Behavior on Unseen Data?
-
Capturing Polysemanticity with PRISM: A Multi-Concept Feature Description Framework
-
Causal Discovery and Inference through Next-Token Prediction
-
Centroid Affinity: How Deep Networks Represent Features
-
Circuit-Tracer: A New Library for Finding Feature Circuits
-
Comparing Clinical and General LLMs on Knowledge Boundaries and Robustness
-
Composable Sparse Subnetworks via Maximum-Entropy Principle
-
Compressed Computation is (probably) not Computation in Superposition
-
Compressed Computation: Dense Circuits in a Toy Model of the Universal-AND Problem
-
Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios
-
ContextBench: Modifying Contexts for Targeted Latent Activation and Behaviour Elicitation
-
Control and Predictivity in Neural Interpretability
-
Controlling Vision–Language–Action Policies through Sparse Latent Directions
-
Convergent Linear Representations of Emergent Misalignment
-
Correlations in the Data Lead to Semantically Rich Feature Geometry Under Superposition
-
Cross-Architecture Model Diffing with Crosscoders: Unsupervised Discovery of Differences Between LLMs
-
Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing
-
Decomposing Attention To Find Context-Sensitive Neurons
-
Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
-
Decomposition of Small Transformer Models
-
Demystifying Cipher-Following in Large Language Models via Activation Analysis
-
Dense SAE Latents Are Features, Not Bugs
-
Detecting and Characterizing Planning in Language Models
-
Detecting Motivated Reasoning in the Internal Representations of Language Models
-
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework
-
Do Natural Language Descriptions of Model Activations Convey Privileged Information?
-
Do We Always Need Sampling? Eliciting Numerical Predictive Distributions of LLMs Without Auto-Regression
-
Does FLUX Know What It’s Writing?
-
Don't Believe the Belief Hype!
-
Dual Mechanisms of Value Expression: Decomposing Intrinsic and Prompted Values in Language Models
-
Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
-
Eliciting Secret Knowledge from Language Models
-
Emergence of Linear Truth Encodings in Language Models
-
Emergent Specialization: Rare Token Neurons in Language Models
-
Emergent World Beliefs: Exploring Transformers in Stochastic Games
-
Emerging Human-like Strategies for Semantic Memory Foraging in Large Language Models
-
Enforcing Orderedness in SAEs to Improve Feature Consistency
-
Entity Multiplexing Through Activation Strength: Understanding goals in A Maze Solving Agent
-
Equivalent Linear Mappings of Large Language Models
-
Evaluating Explanatory Evaluations: An Explanatory Virtues Framework for Mechanistic Interpretability
-
Evaluating SAE interpretability without explanations
-
Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability
-
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
-
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
-
Feature interactions in sparse crosscoders from compact proofs
-
Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts
-
Finding Manifolds with Bilinear Autoencoders
-
Fluid Reasoning Representations
-
From Black-box to Causal-box: Towards Building More Interpretable Models
-
From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
-
From Local to Contextually-Enriched Local Representations: A Mechanism for Holistic Processing in DINOv2 ViTs
-
From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs
-
From Tokens to Semantics: The Emergence and Stabilization of Polysemanticity in Language Models
-
Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition
-
Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders
-
Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning
-
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
-
Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task
-
Higher-Order Component Attribution via Kolmogorov–Arnold Networks
-
How does Mamba Perform Associative Recall? A Mechanistic Study
-
Instruction Following by Boosting Attention of Large Language Models
-
InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation
-
Interpretability at the Network Level: Prior-Guided Drift Diffusion for Neural Circuit Analysis
-
Interpretability for Time Series Transformers using A Concept Bottleneck Framework
-
Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
-
Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision–Language Models
-
Interpreting ResNet-based CLIP via Neuron-Attention Decomposition
-
Interpreting Vision Grounding in Vision-Language Models: A Case Study in Coordinate Prediction
-
Iterative Inference in a Chess-Playing Neural Network
-
Just-in-time and distributed task representations in language models
-
Language Models use Lookbacks to Track Beliefs
-
Latent Crystallographic Microscope: Probing the Emergent Crystallographic Knowledge in Large Language Models
-
Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations
-
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
-
Learning to Steer: Input-dependent Steering for Multimodal LLMs
-
LLM Pretraining with Continuous Concepts
-
LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS
-
Localizing Reasoning Training-Induced Changes in Large Language Models
-
Looking into Black Box Code Language Models
-
Mapping Faithful Reasoning in Language Models
-
Measuring Sparse Autoencoder Feature Sensitivity
-
Mechanistic Evaluation of Transformers and State-Space Models
-
Mechanistic evidence that motif-gated domain recognition drives contact prediction in protein language models
-
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG
-
Mitigating Emergent Misalignment with Data Attribution
-
Mitigating Sycophancy in Language Models via Sparse Activation Fusion and Multi-Layer Activation Steering
-
Model Organisms for Emergent Misalignment
-
Motifs in Attention Patterns of Large Language Models
-
Multimodal Concept Bottleneck Models
-
Multiple Streams of Knowledge Retrieval: Enriching and Recalling in Transformers
-
Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences
-
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
-
nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers
-
On the Geometry and Topology of Neural Circuits for Modular Addition
-
On the Limits of Linear Representation Hypotheses in Large Language Models: A Dynamical Systems Analysis
-
Open-Vocabulary Natural-Language Explanations of LLM Activations via Soft Prompts
-
OpenMAIA: a Multimodal Automated Interpretability Agent based on open-source models
-
Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
-
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
-
Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model
-
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
-
Pinpointing Attention-Causal Communication in Language Models
-
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
-
Predicting Weak-to-Strong Generalization from Latent Representations
-
Probing by Analogy: Decomposing Probes into Activations for Better Interpretability and Inter-Model Generalization
-
Quiet Feature Learning in Algorithmic Tasks
-
Rank-1 LoRAs Encode Interpretable Reasoning Signals
-
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability
-
ReflCtrl: Controlling LLM Reflection via Representation Engineering
-
RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching
-
Representation Similarity Reveals Implicit Layer Grouping in Neural Networks
-
Rethinking Crowd-Sourced Evaluation of Neuron Explanations
-
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
-
Reverse Engineering a Stateful Reasoning Circuit
-
Reverse-Engineering Memory in DreamerV3: From Sparse Representations to Functional Circuits
-
RippleBench: Capturing Ripple Effects by Leveraging Existing Knowledge Repositories
-
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
-
Robustly Improving LLM Fairness in Realistic Settings via Interpretability
-
SAE-ception: Iteratively Using Sparse Autoencoders as a Training Signal
-
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
-
Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models
-
Shared Memorization Structures in Transformers Revealed by Loss Curvature
-
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behaviour
-
Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence
-
Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors
-
Some Attention is All You Need for Retrieval
-
Sparse Autoencoders Trained on the Same Data Learn Different Features
-
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
-
Spectral Dynamics in Neural Network Training: Mathematical Foundations for Understanding Representational Development
-
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
-
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
-
Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
-
SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals
-
Superposition in Mixture of Experts
-
Symbolic Policy Distillation for Interpretable Reinforcement Learning
-
Symbolic vs. Continuous Features in Transformers: A Digital Communication System's Explanation
-
The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features
-
The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning
-
The Geometry of Self-Verification in a Task-Specific Reasoning Model
-
The Impossibility of Inverse Permutation Learning in Transformer Models
-
Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs
-
Thought Anchors: Which LLM Reasoning Steps Matter?
-
Thought Branches: Interpreting LLM Reasoning Requires Resampling
-
Three Desiderata for Faithfulness in Machine Learning Explanations: The Case for Causal Abstraction
-
Token Entanglement in Subliminal Learning
-
TopKLoRA
-
Toward a Theory of Generalizability in LLM Mechanistic Interpretability Research
-
Towards a Mechanistic Understanding of Robustness in Finetuned Reasoning Models
-
Towards Understanding Multimodal Fine-Tuning: A Case Study into Spatial Features
-
Towards Understanding the Nature of Attention with Low-Rank Sparse Decomposition
-
Training Reliable Activation Probes With a Handful of Positive Examples
-
Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability
-
Trilemma of Truth in Large Language Models
-
Uncovering Object Localization Mechanisms in VLMs
-
Understanding sparse autoencoder scaling in the presence of feature manifolds
-
Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact
-
Unsupervised decoding of encoded reasoning using language model interpretability
-
Unveiling the Latent Directions of Reflection in Large Language Models
-
Vector Arithmetic in Concept and Token Subspaces
-
Visual Exploration of Feature Relationships in Sparse Autoencoders with Curated Concepts
-
WASP: A Weight-Space Approach to Detecting Learned Spuriousness
-
Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
-
What Affects the Effective Depth of Large Language Models?
-
What Do Refusal Tokens Learn? Fine-Grained Representations and Evidence for Downstream Steering
-
When seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models
-
Where's the Bug? Attention Probing for Scalable Fault Localization
-
Who is In Charge? Dissecting Role Conflicts in LLM Instruction Following