ICLR 2026PastLarge language models

I Can't Believe It's Not Better: Where Large Language Models Need to Improve

ICLR 2026 Workshop ICBINB

Official website ↗OpenReview venue ↗See all ICLR workshops →✎ Edit this entry

Submission deadline: Feb 1, 2026, 11:59 UTC
OpenReview-synced 2026-02-01 11:59 UTC (as of 2026-06-23) — extensions on OpenReview are applied automatically; verify on the website.
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (56)

Fetched from OpenReview (v2) on 2026-06-10.

A Pilot Study on Doubt Robustness of LLMs in Clinical Prediction Explanation
Juhwan Choi, Sangchul Hahn, Eunho Yang · PDF
AI-rithmetic
Alex Bie, Travis Dick, Alex Kulesza, Prabhakar Raghavan, Vinod Raman, Sergei Vassilvitskii · PDF
Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models
Jakub Binkowski, Kamil Adamczewski, Tomasz Jan Kajdanowicz · PDF
Barriers to Pareto Steerability in Preference-Conditioned LLM Alignment
Fatemeh Nourzad, Daouda Sow, Yingbin Liang, Ming Shi, Ming Zhang, Yunxuan Li, Eylem Ekici, Ness Shroff · PDF
Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs
Aditya Sinha, Harald Steck, Vito Claudio Ostuni, Matteo Rinaldi · PDF
Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models
Hicham Eddoubi, Umar Faruk Abdullahi, Fadi Hassan · PDF
Bigger Is Not Better Under Differential Privacy: Optimization Failure at Eleven-Billion Scale in Vision–Language Model Fine-Tuning
Tzuen Su, Li-Hong Guo, Yangmi Su, Cheng-Yen Li · PDF
Can LLMs Perceive Time? An Empirical Investigation
Aniketh Garikaparthi · PDF
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic · PDF
Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search
Jacopo Minniti, Neil Band, Tim G. J. Rudner · PDF
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Maggie Ziyu Huan, Yuetai Li, Tianyu Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue · PDF
Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
Aarush Sinha · PDF
EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
Shih-Yang Liu, Maksim Khadkevich, Nai Chit FUNG, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen · PDF
EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
Aman Sharma, Paras Chopra · PDF
Evaluating Ill-Defined Tasks in Large Language Models
Yi Zhou, Basel Shbita · PDF
Evaluation-Conditioned Trojan Attack
Zihan Zhu, Hanlin Zhang, Giovanni D'Antonio, Anton Tsitsulin, Sham M. Kakade, Vahab Mirrokni · PDF
Fairness Failure Modes of Multimodal LLMs
Canyu Chen, Anglin Cai, Joan Nwatu, Yale Li, Han Liu, Jessica Hullman, Rada Mihalcea, Kathleen McKeown, Manling Li · PDF
FLUFFINJECTOR: DIAGNOSING LOGICAL CONSISTENCY FAILURES IN CHAIN-OF-THOUGHT REWARD MODELS
Varshith Vijjapu, Krishiv Ray, Archana Vaidheeswaran · PDF
I Can't Believe It Can't Count: Vision-Language Models Fail at Basic Enumeration Beyond the Subitizing Range
Amirhossein Afsharrad, Seyed Shahabeddin Mousavi, Sanjay Lall · PDF
I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha · PDF
I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation
Fay Elhassan, David Sasu, Lars Henning Klein, Alexandra V. Kulinkina, Mary-Anne Hartley · PDF
I Can't Believe LLMs Still Can't Write Drama: Multi-Dimensional Failures in Script Continuation
Shijian Ma, Yunqi Huang, Lin Yan · PDF
Improving Proxy Transfer via Intermediate Proxy Tuning
Kevin Kuo, Ayush Sehgal, Robert Pare, Virginia Smith · PDF
Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
Viliana Devbunova · PDF
Knowing Is Not Seeing. Limits of Physical Problem Solving in VLMs
Karim Elmaaroufi, Kevin Chon, Justin Svegliato, Lakshya A Agrawal, Matei Zaharia, Sanjit A. Seshia · PDF
Language-Dependent Miscalibration in Multilingual LLM Evaluators
Ej Zhou, Lucas Resck, Zheng Hui, Anna Korhonen · PDF
Learning State-Tracking from Code: REPL Traces and Probabilistic Automata
Julien Siems, Riccardo Grazzi, Kirill Kalinin, Hitesh Ballani, Babak Rahmani · PDF
Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs
Suraj Yadav, Siddharth Yadav, Parth Goyal · PDF
Lost in Translation: Why SOTA LLMs Struggle with French NLU Frontiers
David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury · PDF
More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression
Aryan Sood, Tanvi Sharma, Vansh Agrawal · PDF
NON-MONOTONICITY AND CATASTROPHIC RISK OF PROMPT INTERVENTIONS IN ADVERSARIAL LLM CONTROL
Koki Inoue, Naoya Takashima, Hayato Fujihara, SHUYA HIGUCHI, Kota Shimomura, Ryuta Shimogauchi, Takayoshi Yamashita · PDF
Not All Time Is Gregorian: Evaluating LLMs on Cultural Calendar Systems
Deepon Halder, Adish Pandya, Raj Dabre · PDF
One Step Forward, Two Steps Back: Regression Errors and Cost Inefficiencies in LLM Iterative Refinement for Code Generation
Lucas Teixeira Borges, RICARDO RIOS · PDF
Probing and Steering Chain-of-Thought Unfaithfulness in Language Models
Giovanni Maria Occhipinti, Alessandro Abate, Nandi Schoots · PDF
QuanBench Plus: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Hasan Abed Al Kader Hammoud, Ammar Mohanna, Bernard Ghanem · PDF
Query Timing Produces Opposite Positional Biases Between LLMs and Humans
Jasin Cekinmez, Addison J. Wu, Thomas L. Griffiths · PDF
Random Is Hard to Beat: Active Selection in Online DPO with Modern LLMs
Giyeong Oh, Junghyun Lee, Jaehyun Park, Youngjae Yu, Wonho Bae, Junhyug Noh · PDF
Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG
Martin Asenov, Kenza Benkirane, Daniel Goldwater, Aneiss Ghodsi · PDF
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, Aditi Raghunathan · PDF
Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA
Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata, Kranthi Kiran GV, Wesley Tam, Bala Krishna S Vegesna · PDF
Style over Substance: LLM-as-a-Judge Fails to Evaluate Multi-Party Social Dialogue
Kunal Samanta, Faisal Tareque Shohan, Amine Trabelsi, Richard Khoury · PDF
Synthetic Error Injection Fails to Elicit Self-Correction In Language Models
David Xing Wu, Shreyas Kapur, Anant Sahai, Stuart Russell · PDF
The $\Psi$ Paradox in Extreme Superposition: When ETF Alignment Does Not Predict Language Model Generalization
Hyunjun Kim · PDF
The Anatomy of Uncertainty in LLMs
Aditya Taparia, Ransalu Senanayake, Kowshik Thopalli, Vivek Narayanaswamy · PDF
The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning
Yikun Zong, Cheston Tan · PDF
The Cost of Consistency: Why Cross-Plane Contrastive Learning Fails to Bridge the Gap Between MedSAM-3 and nnU-Net
Madhu Shree Aravindan, Aaditi V Bajpai, Ramamoorthy Sriramulu · PDF
The Limits of Long-Context Reasoning in Automated Bug Fixing
Ravi Shanker Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker · PDF
The Low-Frequency Trap: Why Scaling Doesn't Solve Simple Temporal Counting
Sarvesh Baskar, Muhammad R. Islam, Zikui Cai, Ankit Nakhawa, Anirudh Satheesh, Tom Goldstein, Furong Huang · PDF
The Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries
Nora Petrova, John Burden · PDF
The Selective Safety Trap: How LLMs Scaling and Alignment Fail to Generalize Across Minority Demographics
Iago Alves Brito, Walcy Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho · PDF
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
Yueqian Lin, Zhengmian Hu, Qinsi Wang, Yudong Liu, Hengfan Zhang, Jayakumar Subramanian, Nikos Vlassis, Hai Helen Li, Yiran Chen · PDF
When can you TRUST Large Language Models?
Radu Paradovschi, Darvin Yi, Andrew Rabinovich, Zhao Chen · PDF
When Lie Detectors Learn Model Identity: Confounds in Black-Box Sandbagging Detection
Lin Yulong, Pablo Bernabeu-Perez, Benjamin Arnav, Lennie Wells, Mary Phuong · PDF
When Rubrics Backfire: Systematic Preference Drift in LLM Judges
Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Steven Wu, Zhun Deng · PDF
WHEN STABILITY FAILS: HIDDEN FAILURE MODES OF LLMS IN DATA-CONSTRAINED SCIENTIFIC DECISION-MAKING
Nazia Riasat · PDF
Why Large Language Models Fail for Hausa Educational Content: Cascading Errors from Translation to Speech to Comprehension
Honour-Jesus Bezaleel, Pearse Jim, Moses Daudu · PDF

Accepted papers (56)

☆A Pilot Study on Doubt Robustness of LLMs in Clinical Prediction Explanation

☆AI-rithmetic

☆Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models

☆Barriers to Pareto Steerability in Preference-Conditioned LLM Alignment

☆Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs

☆Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

☆Bigger Is Not Better Under Differential Privacy: Optimization Failure at Eleven-Billion Scale in Vision–Language Model Fine-Tuning

☆Can LLMs Perceive Time? An Empirical Investigation

☆Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

☆Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search

☆Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

☆Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval

☆EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

☆EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

☆Evaluating Ill-Defined Tasks in Large Language Models

☆Evaluation-Conditioned Trojan Attack

☆Fairness Failure Modes of Multimodal LLMs

☆FLUFFINJECTOR: DIAGNOSING LOGICAL CONSISTENCY FAILURES IN CHAIN-OF-THOUGHT REWARD MODELS

☆I Can't Believe It Can't Count: Vision-Language Models Fail at Basic Enumeration Beyond the Subitizing Range

☆I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

☆I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation

☆I Can't Believe LLMs Still Can't Write Drama: Multi-Dimensional Failures in Script Continuation

☆Improving Proxy Transfer via Intermediate Proxy Tuning

☆Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

☆Knowing Is Not Seeing. Limits of Physical Problem Solving in VLMs

☆Language-Dependent Miscalibration in Multilingual LLM Evaluators

☆Learning State-Tracking from Code: REPL Traces and Probabilistic Automata

☆Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

☆Lost in Translation: Why SOTA LLMs Struggle with French NLU Frontiers

☆More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression

☆NON-MONOTONICITY AND CATASTROPHIC RISK OF PROMPT INTERVENTIONS IN ADVERSARIAL LLM CONTROL

☆Not All Time Is Gregorian: Evaluating LLMs on Cultural Calendar Systems

☆One Step Forward, Two Steps Back: Regression Errors and Cost Inefficiencies in LLM Iterative Refinement for Code Generation

☆Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

☆QuanBench Plus: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

☆Query Timing Produces Opposite Positional Biases Between LLMs and Humans

☆Random Is Hard to Beat: Active Selection in Online DPO with Modern LLMs

☆Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

☆Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

☆Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

☆Style over Substance: LLM-as-a-Judge Fails to Evaluate Multi-Party Social Dialogue

☆Synthetic Error Injection Fails to Elicit Self-Correction In Language Models

☆The $\Psi$ Paradox in Extreme Superposition: When ETF Alignment Does Not Predict Language Model Generalization

☆The Anatomy of Uncertainty in LLMs

☆The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning

☆The Cost of Consistency: Why Cross-Plane Contrastive Learning Fails to Bridge the Gap Between MedSAM-3 and nnU-Net

☆The Limits of Long-Context Reasoning in Automated Bug Fixing

☆The Low-Frequency Trap: Why Scaling Doesn't Solve Simple Temporal Counting

☆The Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries

☆The Selective Safety Trap: How LLMs Scaling and Alignment Fail to Generalize Across Minority Demographics

☆Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

☆When can you TRUST Large Language Models?

☆When Lie Detectors Learn Model Identity: Confounds in Black-Box Sandbagging Detection

☆When Rubrics Backfire: Systematic Preference Drift in LLM Judges

☆WHEN STABILITY FAILS: HIDDEN FAILURE MODES OF LLMS IN DATA-CONSTRAINED SCIENTIFIC DECISION-MAKING

☆Why Large Language Models Fail for Hausa Educational Content: Cascading Errors from Translation to Speech to Comprehension

A Pilot Study on Doubt Robustness of LLMs in Clinical Prediction Explanation

AI-rithmetic

Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models

Barriers to Pareto Steerability in Preference-Conditioned LLM Alignment

Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs

Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

Bigger Is Not Better Under Differential Privacy: Optimization Failure at Eleven-Billion Scale in Vision–Language Model Fine-Tuning

Can LLMs Perceive Time? An Empirical Investigation

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval

EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Evaluating Ill-Defined Tasks in Large Language Models

Evaluation-Conditioned Trojan Attack

Fairness Failure Modes of Multimodal LLMs

FLUFFINJECTOR: DIAGNOSING LOGICAL CONSISTENCY FAILURES IN CHAIN-OF-THOUGHT REWARD MODELS

I Can't Believe It Can't Count: Vision-Language Models Fail at Basic Enumeration Beyond the Subitizing Range

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation

I Can't Believe LLMs Still Can't Write Drama: Multi-Dimensional Failures in Script Continuation

Improving Proxy Transfer via Intermediate Proxy Tuning

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Knowing Is Not Seeing. Limits of Physical Problem Solving in VLMs

Language-Dependent Miscalibration in Multilingual LLM Evaluators

Learning State-Tracking from Code: REPL Traces and Probabilistic Automata

Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

Lost in Translation: Why SOTA LLMs Struggle with French NLU Frontiers

More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression

NON-MONOTONICITY AND CATASTROPHIC RISK OF PROMPT INTERVENTIONS IN ADVERSARIAL LLM CONTROL

Not All Time Is Gregorian: Evaluating LLMs on Cultural Calendar Systems

One Step Forward, Two Steps Back: Regression Errors and Cost Inefficiencies in LLM Iterative Refinement for Code Generation

Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

QuanBench Plus: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Query Timing Produces Opposite Positional Biases Between LLMs and Humans

Random Is Hard to Beat: Active Selection in Online DPO with Modern LLMs

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

Style over Substance: LLM-as-a-Judge Fails to Evaluate Multi-Party Social Dialogue

Synthetic Error Injection Fails to Elicit Self-Correction In Language Models

The $\Psi$ Paradox in Extreme Superposition: When ETF Alignment Does Not Predict Language Model Generalization

The Anatomy of Uncertainty in LLMs

The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning

The Cost of Consistency: Why Cross-Plane Contrastive Learning Fails to Bridge the Gap Between MedSAM-3 and nnU-Net

The Limits of Long-Context Reasoning in Automated Bug Fixing

The Low-Frequency Trap: Why Scaling Doesn't Solve Simple Temporal Counting

The Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries

The Selective Safety Trap: How LLMs Scaling and Alignment Fail to Generalize Across Minority Demographics

Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

When can you TRUST Large Language Models?

When Lie Detectors Learn Model Identity: Confounds in Black-Box Sandbagging Detection

When Rubrics Backfire: Systematic Preference Drift in LLM Judges

WHEN STABILITY FAILS: HIDDEN FAILURE MODES OF LLMS IN DATA-CONSTRAINED SCIENTIFIC DECISION-MAKING

Why Large Language Models Fail for Hausa Educational Content: Cascading Errors from Translation to Speech to Comprehension