COLM 2025PastInterpretability

The First Workshop on the Interplay of Model Behavior and Model Internals

INTERPLAY

Official website ↗OpenReview venue ↗See all COLM workshops →✎ Edit this entry

Submission deadline: Jul 11, 2025, 07:55 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (22)

Fetched from OpenReview (v2) on 2026-06-11.

Analyzing Representational Shifts in Multimodal Models: A Study of Feature Dynamics in Gemma and PaliGemma
Aaron C Friedman, Trinabh Gupta, Raine Ma, Sean O'Brien, Kevin Zhu, Cole Blondin · PDF
Angular Steering: Behavior Control via Rotation in Activation Space
Hieu M. Vu, Tan Minh Nguyen · PDF
Attributing Response to Context: A Jensen–Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz · PDF
BERTology in the Modern World
Michael Li, Nishant Subramani · PDF
Causal Interventions Reveal Shared Structure Across English Filler–Gap Constructions
Sasha Boguraev, Christopher Potts, Kyle Mahowald · PDF
Comparing Prompt and Representation Engineering for Personality Control in Language Models: A Case Study
Pengrui Han · PDF
Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing
McNair Shah, Saleena Angeline Sartawita, Adhitya Rajendra Kumar, Naitik Chheda, Kevin Zhu, Vasu Sharma, Sean O'Brien, Will Cai · PDF
Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models
Benjamin Reichman, Adar Avsian, Larry Heck · PDF
Evaluating Contrast Localizer for Identifying Causal Units in Social & Mathematical Tasks in Language Models
Yassine Jamaa, Badr AlKhamissi, Satrajit S Ghosh, Martin Schrimpf · PDF
From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
Karim Saraipour, Shichang Zhang · PDF
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Yizhou Sun, Himabindu Lakkaraju, Shichang Zhang · PDF
Interpreting the Latent Structure of Operator Precedence in Language Models
Dharunish Yugeswardeenoo, Harshil Nukala, Cole Blondin, Sean O'Brien, Vasu Sharma, Kevin Zhu · PDF
LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization
Jiarui Liu, Jivitesh Jain, Mona T. Diab, Nishant Subramani · PDF
Localizing Persona Representations in LLMs
Celia Cintas, Miriam Rateike, Erik Miehling, Elizabeth M. Daly, Skyler Speakman · PDF
On the Geometry of Semantics in Next-token Prediction
Yize Zhao, Christos Thrampoulidis · PDF
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
Jacob Dunefsky, Arman Cohan · PDF
Predicting Success of Model Editing via Intrinsic Features
Yanay Soker, Martin Tutek, Yonatan Belinkov · PDF
Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking
Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen, Xi Ye · PDF
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Shaan Shah, Kaustubh Ponkshe, Raghav Singhal, Praneeth Vepakomma · PDF
Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs
Ziling Cheng, Meng Cao, Marc-Antoine Rondeau, Jackie CK Cheung · PDF
Understanding In-context Learning of Addition via Activation Subspaces
Xinyan Hu, Kayo Yin, Michael I. Jordan, Jacob Steinhardt, Lijie Chen · PDF
Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact
Advey Nandan, Cheng-Ting Chou, Amrit Kurakula, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien · PDF

Accepted papers (22)

☆Analyzing Representational Shifts in Multimodal Models: A Study of Feature Dynamics in Gemma and PaliGemma

☆Angular Steering: Behavior Control via Rotation in Activation Space

☆Attributing Response to Context: A Jensen–Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

☆BERTology in the Modern World

☆Causal Interventions Reveal Shared Structure Across English Filler–Gap Constructions

☆Comparing Prompt and Representation Engineering for Personality Control in Language Models: A Case Study

☆Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

☆Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models

☆Evaluating Contrast Localizer for Identifying Causal Units in Social & Mathematical Tasks in Language Models

☆From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

☆How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

☆Interpreting the Latent Structure of Operator Precedence in Language Models

☆LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

☆Localizing Persona Representations in LLMs

☆On the Geometry of Semantics in Next-token Prediction

☆One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs

☆Predicting Success of Model Editing via Intrinsic Features

☆Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

☆Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

☆Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs

☆Understanding In-context Learning of Addition via Activation Subspaces

☆Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact

Analyzing Representational Shifts in Multimodal Models: A Study of Feature Dynamics in Gemma and PaliGemma

Angular Steering: Behavior Control via Rotation in Activation Space

Attributing Response to Context: A Jensen–Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

BERTology in the Modern World

Causal Interventions Reveal Shared Structure Across English Filler–Gap Constructions

Comparing Prompt and Representation Engineering for Personality Control in Language Models: A Case Study

Death by a Thousand Directions: Exploring the Geometry of Harmfulness in LLMs through Subconcept Probing

Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models

Evaluating Contrast Localizer for Identifying Causal Units in Social & Mathematical Tasks in Language Models

From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Interpreting the Latent Structure of Operator Precedence in Language Models

LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization

Localizing Persona Representations in LLMs

On the Geometry of Semantics in Next-token Prediction

One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs

Predicting Success of Model Editing via Intrinsic Features

Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs

Understanding In-context Learning of Addition via Activation Subspaces

Universal Neurons in GPT-2: Emergence, Persistence, and Functional Impact