NeurIPS 2025 Past InterpretabilityNeuroscience

First Workshop on CogInterp: Interpreting Cognition in Deep Learning Models

CogInterp @ NeurIPS 2025

Submission deadline
Aug 28, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (112)

Fetched from OpenReview (v2) on 2026-06-10.

  1. (How) Do LLMs Plan in One Forward Pass?

    Michael Hanna, Emmanuel Ameisen · PDF
  2. A Cognitive Architecture for Probing Hierarchical Processing and Predictive Coding in Deep Vision Models

    Brennen Hill, Zhang Xinyu, Timothy Putra Prasetio · PDF
  3. A Computational Model for Binding by Enhanced Firing Rate: Implementing Smooth Power-law enhancement in Object-Centric Representations

    Ishanvir S. Choongh, Manu Madhav · PDF
  4. A Control-Theoretic Account of Cognitive Effort in Language Models

    Pranjal Garg · PDF
  5. A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

    Claire O'Brien, Jessica Seto, Dristi Roy, Aditya Dwivedi, Ryan Lagasse, Sunishchal Dev, Kevin Zhu, Sean O'Brien · PDF
  6. A Multi-Method Interpretability Framework for Probing Cognitive Processing in Deep Neural Networks across Vision and Biomedical Domains

    Harshini Suresha, Kavitha S H · PDF
  7. A Neuroscience-Inspired Dual-Process Model of Compositional Generalization

    Alexander Noviello, Claas Beger, Jacob Groner, Kevin Ellis, Weinan Sun · PDF
  8. Acoustic Degradation Reweights Cortical and ASR Processing: A Brain-Model Alignment Study

    Francis Pingfan Chien, Chia-Chun Dan Hsu, Po-Jang Hsieh, Yu Tsao · PDF
  9. Actual or counterfactual? Asymmetric responsibility attributions in language models

    Eric Bigelow, Yang Xiang, Tobias Gerstenberg, Tomer Ullman, Samuel J. Gershman · PDF
  10. Are Humans Evolved Instruction Followers? An Underlying Inductive Bias Enables Rapid Instructed Task Learning

    Anjishnu Kumar · PDF
  11. Assessing Behavioral Effects of Reasoning (or the lack of) in LLMs

    ARTHUR BUZELIN, Samira Malaquias, Victoria Estanislau, Yan Aquino, Pedro Augusto Torres Bento, Lucas Dayrell, Arthur Chagas, Gisele L. Pappa, Wagner Meira Jr. · PDF
  12. Bitter Lesson of the ARC-AGI Challenge: Intelligence may look very different in machines and humans

    Soumya Banerjee · PDF
  13. Bridging the Von Neuman Gap: Why LLMs Haven’t Made Novel Discoveries

    Ashwin Saraswatula · PDF
  14. Can You Spot the Virtual Patient? Expert Review, Turing Test, and Linguistic–Semantic Analysis

    Reyhaneh Hosseinpourkhoshkbari, Wei-chen Huang, Suvel Muttreja, Richard M. Golden · PDF
  15. Causal Interventions on Continuous Features in LLMs: A Case Study in Verb Bias

    Zhenghao Zhou, R. Thomas McCoy, Robert Frank · PDF
  16. Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs

    Lianghuan Huang, Yingshan Chang · PDF
  17. Cognitive Behavior Modeling via Activation Steering

    Anthony Kuang, Ahmed Ismail, Ayo Akinkugbe, Kevin Zhu, Sean O'Brien · PDF
  18. Cognitive Load Traces as Symbolic and Visual Accounts of Deep Model Cognition

    Dong Liu, Yanxuan Yu · PDF
  19. Cognitive Machine Learning for Patient-First Modeling in Clinical Research

    Shashank Uttrani, Shruti Kaushik, Martin White · PDF
  20. Cognitive Maps in Language Models: A Mechanistic Analysis of Spatial Planning

    Caroline Baumgartner, Eleanor Spens, Neil Burgess, Petru Manescu · PDF
  21. Conflict Adaptation in Vision-Language Models

    Xiaoyang Hu · PDF
  22. Context informs pragmatic interpretation in vision–language models

    Alvin Wei Ming Tan, Ben Prystawski, Veronica Boyce, Michael Frank · PDF
  23. CORE – Cognitive Observation of Reasoning Errors

    Janos Horvath · PDF
  24. Culturally transmitted color categories in LLMs reflect a learning bias toward efficient compression

    Nathaniel Imel, Noga Zaslavsky · PDF
  25. CurLL: Curriculum Learning of Language Models

    Pavan Kalyan Tankala, Shubhra Mishra, Satya Lokam, Navin Goyal · PDF
  26. DecepBench: Benchmarking Multimodal Deception Detection

    Vittesh Maganti, Nysa Lalye, Ethan Braverman, Kevin Zhu, Vasu Sharma, Sean O'Brien · PDF
  27. Decoding and Reconstructing Visual Experience from Brain Activity with Generative Latent Representations

    Motokazu Umehara, Yoshihiro Nagano, Misato Tanaka, Yukiyasu Kamitani · PDF
  28. Deconstructing the Reasoning Process of a Neuro-Fuzzy Agent: From Learned Concepts to Natural Language Narratives

    Yumin Zhou, Whye Loon Tung, Hiok Quek · PDF
  29. Demystifying Emergent Exploration in Goal-conditioned RL

    Mahsa Bastankhah, Grace Liu, Dilip Arumugam, Thomas L. Griffiths, Benjamin Eysenbach · PDF
  30. Detecting Motivated Reasoning in the Internal Representations of Language Models

    Parsa Mirtaheri, Mikhail Belkin · PDF
  31. Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction

    James A. Michaelov, Catherine Arnett · PDF
  32. Discovering Functionally Sufficient Projections with Functional Component Analysis

    Satchel Grant · PDF
  33. Disentangling Interpretable Cognitive Variables That Support Human Generalization

    Xinyue Zhu, Daniel L. Kimmel · PDF
  34. Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?

    Siddhant Bhambri, Upasana Biswas, Subbarao Kambhampati · PDF
  35. Do Large Language Models Show Biases in Causal Learning? Insights from Contingency Judgment

    María Victoria Carro, Denise Alejandra Mester, Francisca Gauna Selasco, Giovanni Franco Gabriel Marraffini, Mario Leiva, Gerardo Simari, Maria Vanina Martinez · PDF
  36. Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence

    Sanish Suwal, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi · PDF
  37. Does FLUX Know What It’s Writing?

    Adrian Chang, Sheridan Feucht, Byron C Wallace, David Bau · PDF
  38. Don’t Think of the White Bear: Ironic Negation in Transformer Models under Cognitive Load

    Logan Mann, Nayan Saxena, Sarah Tandon, Chenhao Sun, Savar Toteja, Kevin Zhu · PDF
  39. Emergent World Beliefs: Exploring Transformers in Stochastic Games

    Tanish Rastogi, Michael Ma, Adam Kamel, Kailash Ranganathan · PDF
  40. Exact Learning Dynamics of In-Context Learning in Linear Transformers and Its Application to Non-Linear Transformers

    Nischal Mainali, Lucas Teixeira · PDF
  41. Extracting Belief-Update Rules to Explain Theory-of-Mind Generalization Failures

    Joel Phillips Michelson, Deepayan Sanyal, Maithilee Kunda · PDF
  42. Forgetting as a Lens into Model Cognition: Selective Unlearning Reveals Cognitive Biases in Deep Neural Networks

    Kaustubha V · PDF
  43. From Black Box to Bedside: Distilling Reinforcement Learning for Interpretable Sepsis Treatment

    Ella Lan, Andrea Yu, Sergio Charles · PDF
  44. From Cephalopods to Large Language Models: Conceptions of Intelligence and Reasoning

    Soumya Banerjee · PDF
  45. From Comparison to Composition: Towards Understanding Machine Cognition of Unseen Categories

    Minghao Fu, Sheng Zhang, Guangyi Chen, Zijian Li, Fan Feng, Yifan Shen, Shaoan Xie, Kun Zhang · PDF
  46. Fuzzy, Symbolic, and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding

    Vanessa Figueiredo · PDF
  47. GBEval: A SHAP-based Interpretable Gender Bias Assessment Framework for LLMs

    Jayan Adhikari, Raj Dandekar, Rajat Dandekar, Sreedath Panat · PDF
  48. Generating Compromises Between Two Points of View

    Sumanta Bhattacharyya, Francine Chen, Scott Carter, Yan-Ying Chen, Tatiana Lau, Nayeli Suseth Bravo, Monica P Van, Kate Sieck, Charlene C. Wu · PDF
  49. Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows

    Billy Dickson, Zoran Tiganj · PDF
  50. How Do LLMs Ask Questions? A Pragmatic Comparison with Human Question-Asking

    Chani Jung, Jimin Mun, Xuhui Zhou, Alice Oh, Maarten Sap, Hyunwoo Kim · PDF
  51. How Intrinsic Motivation Shapes Learned Representations in Decision Transformers: A Cognitive Interpretability Analysis

    Leonardo Guiducci, Antonio Rizzo, Giovanna Maria Dimitri · PDF
  52. I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

    Puria Radmard, Shi Feng · PDF
  53. Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs

    Sonia Krishna Murthy, Rosie Zhao, Jennifer Hu, Sham M. Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman · PDF
  54. InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation

    Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu · PDF
  55. Interpretable Hybrid Neural-Cognitive Models Discover Cognitive Strategies Underlying Flexible Reversal Learning

    Chonghao Cai, Liyuan Li, Yifei Cao, Maria K Eckstein · PDF
  56. Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

    Siddhant Bhambri, Upasana Biswas, Subbarao Kambhampati · PDF
  57. Interpreting style–content parsing in vision–language models

    Fan L. Cheng, Xin Jing · PDF
  58. Kindness or Sycophancy? Understanding and Shaping Model Personality via Synthetic Games

    Maya Okawa, Ekdeep Singh Lubana, Mai Uchida, Hidenori Tanaka · PDF
  59. Language models can associate objects with their features without forming integrated representations

    Simon Jerome Han, James Lloyd McClelland · PDF
  60. Language Models use Lookbacks to Track Beliefs

    Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger · PDF
  61. Language-Based Dementia Classification Should Consider Model Cognition for Interpretability

    Yui Ishihara, Michelle Cohn, Kartik Patwari, Alyssa Weakley, Chen-Nee Chuah · PDF
  62. Learning to Look: Cognitive Attention Alignment with Vision-Language Models

    Ryan L. Yang, Dipkamal Bhusal, Nidhi Rastogi · PDF
  63. Let's Think 一步一步: A Cognitive Framework for Characterizing Code-Switching in LLM Reasoning

    Eleanor Lin, David Jurgens · PDF
  64. LLM Agents Beyond Utility: An Open-Ended Perspective

    Asen Nachkov, Xi Wang, Luc Van Gool · PDF
  65. LRP-CLIP: A Zero Shot Approach for the Explanation of the Cognitive Functions of Vision Models

    Malte Singerhoff, Viktor Matkovic, Torben Weis · PDF
  66. Measuring LLM Generation Spaces with EigenScore

    Sunny Yu, Myra Cheng, Ahmad Jabbar, Robert D. Hawkins, Dan Jurafsky · PDF
  67. Mechanisms of Symbol Processing in Transformers

    Paul Smolensky, Roland Fernandez, Zhenghao Zhou, Mattia Opper, Adam Davies, Jianfeng Gao · PDF
  68. Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis

    Amartya Hatua · PDF
  69. Mechanistic Interpretability of Semantic Abstraction in Biomedical Text

    Nikhil Gourisetty, Vishnu Srinivas, Snata Mohanty, Soumil Jain, Kevin Zhu, Benjamin Liu, Sunishchal Dev, Sunith Vallabhaneni · PDF
  70. MetaCD: A Meta Learning Framework for Cognitive Diagnosis based on Continual Learning

    Jin Wu, Chanjin Zheng · PDF
  71. Metacognitive Sensitivity for Test-Time Dynamic Model Selection

    Le Tuan Minh Trinh, Le Minh Vu Pham, Thi Minh Anh Pham, An Duc Nguyen · PDF
  72. Mind Games Machines Play: Contrastive Cognitive Bias Detection in LLMs and Distilled Models

    Anusha Asim, Maryam Rifah · PDF
  73. Minimization of Boolean Complexity in In-Context Concept Learning

    Leroy Z. Wang, R. Thomas McCoy, Shane Steinert-Threlkeld · PDF
  74. Misalignment Between Vision-Language Representations in Vision-Language Models

    Yonatan Gideoni, Yoav Gelberg, Tim G. J. Rudner, Yarin Gal · PDF
  75. Modulation of temporal decision-making in a deep reinforcement learning agent under the dual-task paradigm

    Amrapali Pednekar, Álvaro Garrido Pérez, Yara Khaluf, Pieter Simoens · PDF
  76. NiceWebRL: a Python library for human subject experiments with reinforcement learning environments

    Wilka Carvalho, Vikram Srinivas Goddla, Ishaan Sinha, Hoon Shin, Kunal Jha · PDF
  77. On the Role of Pretraining in Domain Adaptation in an Infant-Inspired Distribution Shift Task

    Deepayan Sanyal, Joel Phillips Michelson, Maithilee Kunda · PDF
  78. Pedagogical Alignment of LLMs requires Diverse Cognitively-Inspired Student Proxies

    Suchir Salhan, Andrew Caines, Paula Buttery · PDF
  79. Perceived vs. True Emergence: A Cognitive Account of Generalization in Clinical Time Series Models

    Shashank Yadav · PDF
  80. Personality Manipulation as a Cognitive Probe in Large Language Models

    Gunmay Handa, Zekun Wu, Adriano Koshiyama, Philip Colin Treleaven · PDF
  81. PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

    Jing-Jing Li, Joel Mire, Eve Fleisig, Valentina Pyatkin, Maarten Sap, Sydney Levine · PDF
  82. Post-hoc Stochastic Concept Bottleneck Models

    Wiktor Hoffmann, Sonia Laguna, Moritz Vandenhirtz, Emanuele Palumbo, Julia E Vogt · PDF
  83. Predicting the Formation of Induction Heads

    Tatsuya Aoyama, Ethan Wilcox, Nathan Schneider · PDF
  84. Priors in Time: A Generative View of Sparse Autoencoders for Sequential Representations

    Ekdeep Singh Lubana, Sai Sumedh R. Hindupur, Can Rager, Valérie Costa, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Greta Tuckute, Daniel Wurgaft, Demba E. Ba, Melanie Weber, Aaron Mueller · PDF
  85. Privileged Self-Access Matters for Introspection in AI

    Siyuan Song, Harvey Lederman, Jennifer Hu, Kyle Mahowald · PDF
  86. Reverse-Engineering Memory in DreamerV3: From Sparse Representations to Functional Circuits

    Jan Sobotka, Auke Ijspeert, Guillaume Bellegarda · PDF
  87. RNNs reveal a new optimal stopping rule in sequential sampling for decision-making

    Jialin Li, Kenway Louie, Paul W. Glimcher, Bo Shen · PDF
  88. Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

    Lang Xiong, Raina Gao, Alyssa Jeong, Yicheng Fu, Kevin Zhu, Sean O'Brien, Vasu Sharma · PDF
  89. Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models

    Sayam Goyal, Brad Peters, María Emilia Granda, Akshath Vijayakumar Narmadha, Dharunish Yugeswardeenoo, Callum Stuart McDougall, Sean O'Brien, Ashwinee Panda, Kevin Zhu, Cole Blondin · PDF
  90. Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behaviour

    Eric Zhang, Daniel Aarao Reis Arturi, Andrew Adrian Ansah, Kevin Zhu, Ashwinee Panda, Aishwarya Balwani · PDF
  91. Signatures of human-like processing in Transformer forward passes

    Jennifer Hu, Michael A. Lepori, Michael Franke · PDF
  92. Sparse Feature Coactivation Reveals Composable Semantic Modules in Large Language Models

    Ruixuan Deng, Xiaoyang Hu, Miles Gilberti, Shane Storks, Aman Taxali, Mike Angstadt, Chandra Sripada, Joyce Chai · PDF
  93. STAT: Skill-Targeted Adaptive Training

    Yinghui He, Abhishek Panigrahi, Yong Lin, Sanjeev Arora · PDF
  94. Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

    Subbarao Kambhampati, Kaya Stechly, Karthik Valmeekam, Lucas Paul Saldyt, Siddhant Bhambri, Vardhan Palod, Atharva Gundawar, Soumya Rani Samineni, Durgesh Kalwar, Upasana Biswas · PDF
  95. Strategy and structure in Codenames: Comparing human and GPT-4 gameplay

    Noah Prescott, Tracey Mills, Jonathan Phillips · PDF
  96. The Mechanistic Emergence of Symbol Grounding in Language Models

    Ziqiao Ma, Shuyu Wu, Xiaoxi Luo, Yidong Huang, Josue Torres-Fonseca, Freda Shi, Joyce Chai · PDF
  97. The One Where They Brain-Tune for Social Cognition: Multi-Modal Brain-Tuning on Friends

    Nico Policzer, Cameron Braunstein, Mariya Toneva · PDF
  98. Theoretical Linguistics Constrains Hypothesis-Driven Causal Abstraction in Mechanistic Interpretability

    Suchir Salhan, Konstantinos Voudouris · PDF
  99. Towards Cognitively Plausible Concept Learning: Spatially Grounding Concepts with Anatomical Priors

    Yuyu Zhou · PDF
  100. Towards finding consensus about similarity of symbolic encodings associated with concepts between LLMs and human brain

    Sushma Anand Akoju · PDF
  101. Towards Visual Simulation in Multimodal Language Models

    Catherine Finegan-Dollak · PDF
  102. Tracing the Development of Syntax and Semantics in a Model trained on Child-Directed Speech and Visual Input

    Nina Schoener, Mahesh Srinivasan, Colin Conwell · PDF
  103. Understanding Pre-trained and Fine-tuned model behaviour using Model Diffing

    Mallikarjuna Tupakula · PDF
  104. Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory

    Ming Li, Nan Zhang, Chenrui Fan, Hong Jiao, Tianyi Zhou · PDF
  105. Unifying Gestalt Principles Through Inference-Time Prior Integration

    Tahereh Toosi, Kenneth D. Miller · PDF
  106. Unraveling the cognitive patterns of Large Language Models through module communities

    Kushal Raj Bhandari, Pin-Yu Chen, Jianxi Gao · PDF
  107. Value Entanglement: Conflation Between Moral and Grammatical Good In (Some) Large Language Models

    Seong Hah Cho, Junyi Li, Anna Leshinskaya · PDF
  108. Video Finetuning Improves Reasoning Between Frames

    Ruiqi Yang, Tian Yun, Zihan Wang, Ellie Pavlick · PDF
  109. Visual symbolic mechanisms: Emergent symbol processing in vision language models

    Rim Assouel, Declan Iain Campbell, Taylor Whittington Webb · PDF
  110. What Comes to Mind? Interpretable Dimensions in Embedding Space Predict Human Ad Hoc Category Construction

    Alina Dracheva, Jonathan Phillips · PDF
  111. What is a Number, That a Large Language Model May Know It?

    Raja Marjieh, Veniamin Veselovsky, Thomas L. Griffiths, Ilia Sucholutsky · PDF
  112. When Researchers Say Mental Model/Theory of Mind of AI, What Are They Really Talking About?

    Xiaoyun Yin, Elmira Zahmat Doost, Shiwen Zhou, Garima Arya Yadav, Jamie Gorman · PDF