NeurIPS 2025 Past InterpretabilityNeuroscience
First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models
CogInterp @ NeurIPS 2025
- Submission deadline
- Aug 28, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (112)
Fetched from OpenReview (v2) on 2026-06-10.
-
(How) Do LLMs Plan in One Forward Pass?
-
A Cognitive Architecture for Probing Hierarchical Processing and Predictive Coding in Deep Vision Models
-
A Computational Model for Binding by Enhanced Firing Rate: Implementing Smooth Power-law enhancement in Object-Centric Representations
-
A Control-Theoretic Account of Cognitive Effort in Language Models
-
A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy
-
A Multi-Method Interpretability Framework for Probing Cognitive Processing in Deep Neural Networks across Vision and Biomedical Domains
-
A Neuroscience-Inspired Dual-Process Model of Compositional Generalization
-
Acoustic Degradation Reweights Cortical and ASR Processing: A Brain-Model Alignment Study
-
Actual or counterfactual? Asymmetric responsibility attributions in language models
-
Are Humans Evolved Instruction Followers? An Underlying Inductive Bias Enables Rapid Instructed Task Learning
-
Assessing Behavioral Effects of Reasoning (or the lack of) in LLMs
-
Bitter Lesson of the ARC-AGI Challenge: Intelligence may look very different in machines and humans
-
Bridging the Von Neuman Gap: Why LLMs Haven’t Made Novel Discoveries
-
Can You Spot the Virtual Patient? Expert Review, Turing Test, and Linguistic–Semantic Analysis
-
Causal Interventions on Continuous Features in LLMs: A Case Study in Verb Bias
-
Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs
-
Cognitive Behavior Modeling via Activation Steering
-
Cognitive Load Traces as Symbolic and Visual Accounts of Deep Model Cognition
-
Cognitive Machine Learning for Patient-First Modeling in Clinical Research
-
Cognitive Maps in Language Models: A Mechanistic Analysis of Spatial Planning
-
Conflict Adaptation in Vision-Language Models
-
Context informs pragmatic interpretation in vision–language models
-
CORE – Cognitive Observation of Reasoning Errors
-
Culturally transmitted color categories in LLMs reflect a learning bias toward efficient compression
-
CurLL: Curriculum Learning of Language Models
-
DecepBench: Benchmarking Multimodal Deception Detection
-
Decoding and Reconstructing Visual Experience from Brain Activity with Generative Latent Representations
-
Deconstructing the Reasoning Process of a Neuro-Fuzzy Agent: From Learned Concepts to Natural Language Narratives
-
Demystifying Emergent Exploration in Goal-conditioned RL
-
Detecting Motivated Reasoning in the Internal Representations of Language Models
-
Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction
-
Discovering Functionally Sufficient Projections with Functional Component Analysis
-
Disentangling Interpretable Cognitive Variables That Support Human Generalization
-
Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?
-
Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment
-
Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence
-
Does FLUX Know What It’s Writing?
-
Don’t Think of the White Bear: Ironic Negation in Transformer Models under Cognitive Load
-
Emergent World Beliefs: Exploring Transformers in Stochastic Games
-
Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers
-
Extracting Belief-Update Rules to Explain Theory-of-Mind Generalization Failures
-
Forgetting as a Lens into Model Cognition: Selective Unlearning Reveals Cognitive Biases in Deep Neural Networks
-
From Black Box to Bedside: Distilling Reinforcement Learning for Interpretable Sepsis Treatment
-
From Cephalopods to Large Language Models: Conceptions of Intelligence and Reasoning
-
From Comparison to Composition: Towards Understanding Machine Cognition of Unseen Categories
-
Fuzzy, Symbolic, and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding
-
GBEval: A SHAP-based Interpretable Gender Bias Assessment Framework for LLMs
-
Generating Compromises Between Two Points of View
-
Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows
-
How Do LLMs Ask Questions? A Pragmatic Comparison with Human Question-Asking
-
How Intrinsic Motivation Shapes Learned Representations in Decision Transformers: A Cognitive Interpretability Analysis
-
I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs
-
Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
-
InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation
-
Interpretable Hybrid Neural-Cognitive Models Discover Cognitive Strategies Underlying Flexible Reversal Learning
-
Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation
-
Interpreting style–content parsing in vision–language models
-
Kindness or Sycophancy? Understanding and Shaping Model Personality via Synthetic Games
-
Language models can associate objects with their features without forming integrated representations
-
Language Models use Lookbacks to Track Beliefs
-
Language-Based Dementia Classification Should Consider Model Cognition for Interpretability
-
Learning to Look: Cognitive Attention Alignment with Vision-Language Models
-
Let's Think 一步一步: A Cognitive Framework for Characterizing Code-Switching in LLM Reasoning
-
LLM Agents Beyond Utility: An Open-Ended Perspective
-
LRP-CLIP: A Zero Shot Approach for the Explanation of the Cognitive Functions of Vision Models
-
Measuring LLM Generation Spaces with EigenScore
-
Mechanisms of Symbol Processing in Transformers
-
Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis
-
Mechanistic Interpretability of Semantic Abstraction in Biomedical Text
-
MetaCD: A Meta Learning Framework for Cognitive Diagnosis based on Continual Learning
-
Metacognitive Sensitivity for Test-Time Dynamic Model Selection
-
Mind Games Machines Play: Contrastive Cognitive Bias Detection in LLMs and Distilled Models
-
Minimization of Boolean Complexity in In-Context Concept Learning
-
Misalignment Between Vision-Language Representations in Vision-Language Models
-
Modulation of temporal decision-making in a deep reinforcement learning agent under the dual-task paradigm
-
NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
-
On the Role of Pretraining in Domain Adaptation in an Infant-Inspired Distribution Shift Task
-
Pedagogical Alignment of LLMs requires Diverse Cognitively-Inspired Student Proxies
-
Perceived vs. True Emergence: A Cognitive Account of Generalization in Clinical Time Series Models
-
Personality Manipulation as a Cognitive Probe in Large Language Models
-
PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm
-
Post-hoc Stochastic Concept Bottleneck Models
-
Predicting the Formation of Induction Heads
-
Priors in Time: A Generative View of Sparse Autoencoders for Sequential Representations
-
Privileged Self-Access Matters for Introspection in AI
-
Reverse-Engineering Memory in DreamerV3: From Sparse Representations to Functional Circuits
-
RNNs reveal a new optimal stopping rule in sequential sampling for decision-making
-
Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques
-
Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models
-
Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behaviour
-
Signatures of human-like processing in Transformer forward passes
-
Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models
-
STAT: Skill-Targeted Adaptive Training
-
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
-
Strategy and structure in Codenames: Comparing human and GPT-4 gameplay
-
The Mechanistic Emergence of Symbol Grounding in Language Models
-
The One Where They Brain-Tune for Social Cognition: Multi-Modal Brain-Tuning on Friends
-
Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability
-
Towards Cognitively Plausible Concept Learning: Spatially Grounding Concepts with Anatomical Priors
-
Towards finding consensus about similarity of symbolic encodings associated with concepts between LLMs and human brain
-
Towards Visual Simulation in Multimodal Language Models
-
Tracing the Development of Syntax and Semantics in a Model trained on Child-Directed Speech and Visual Input
-
Understanding Pre-trained and Fine-tuned model behaviour using Model Diffing
-
Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory
-
Unifying Gestalt Principles Through Inference-Time Prior Integration
-
Unraveling the cognitive patterns of Large Language Models through module communities
-
Value Entanglement: Conflation Between Moral and Grammatical Good In (Some) Large Language Models
-
Video Finetuning Improves Reasoning Between Frames
-
Visual symbolic mechanisms: Emergent symbol processing in vision language models
-
What Comes to Mind? Interpretable Dimensions in Embedding Space Predict Human Ad Hoc Category Construction
-
What is a Number, That a Large Language Model May Know It?
-
When Researchers Say Mental Model/Theory of Mind of AI, What Are They Really Talking About?