ICLR 2026 Past Large language models

I Can't Believe It's Not Better: Where Large Language Models Need to Improve

ICLR 2026 Workshop ICBINB

Submission deadline
Jan 31, 2026, 23:59 AoE (UTC−12)
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (56)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Pilot Study on Doubt Robustness of LLMs in Clinical Prediction Explanation

    Juhwan Choi, Sangchul Hahn, Eunho Yang · PDF
  2. AI-rithmetic

    Alex Bie, Travis Dick, Alex Kulesza, Prabhakar Raghavan, Vinod Raman, Sergei Vassilvitskii · PDF
  3. Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models

    Jakub Binkowski, Kamil Adamczewski, Tomasz Jan Kajdanowicz · PDF
  4. Barriers to Pareto Steerability in Preference-Conditioned LLM Alignment

    Fatemeh Nourzad, Daouda Sow, Yingbin Liang, Ming Shi, Ming Zhang, Yunxuan Li, Eylem Ekici, Ness Shroff · PDF
  5. Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs

    Aditya Sinha, Harald Steck, Vito Claudio Ostuni, Matteo Rinaldi · PDF
  6. Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

    Hicham Eddoubi, Umar Faruk Abdullahi, Fadi Hassan · PDF
  7. Bigger Is Not Better Under Differential Privacy: Optimization Failure at Eleven-Billion Scale in Vision–Language Model Fine-Tuning

    Tzuen Su, Li-Hong Guo, Yangmi Su, Cheng-Yen Li · PDF
  8. Can LLMs Perceive Time? An Empirical Investigation

    Aniketh Garikaparthi · PDF
  9. Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

    Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic · PDF
  10. Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search

    Jacopo Minniti, Neil Band, Tim G. J. Rudner · PDF
  11. Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    Maggie Ziyu Huan, Yuetai Li, Tianyu Zheng, Xiaoyu Xu, Seungone Kim, Minxin Du, Radha Poovendran, Graham Neubig, Xiang Yue · PDF
  12. Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval

    Aarush Sinha · PDF
  13. EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

    Shih-Yang Liu, Maksim Khadkevich, Nai Chit FUNG, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen · PDF
  14. EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

    Aman Sharma, Paras Chopra · PDF
  15. Evaluating Ill-Defined Tasks in Large Language Models

    Yi Zhou, Basel Shbita · PDF
  16. Evaluation-Conditioned Trojan Attack

    Zihan Zhu, Hanlin Zhang, Giovanni D'Antonio, Anton Tsitsulin, Sham M. Kakade, Vahab Mirrokni · PDF
  17. Fairness Failure Modes of Multimodal LLMs

    Canyu Chen, Anglin Cai, Joan Nwatu, Yale Li, Han Liu, Jessica Hullman, Rada Mihalcea, Kathleen McKeown, Manling Li · PDF
  18. FLUFFINJECTOR: DIAGNOSING LOGICAL CONSISTENCY FAILURES IN CHAIN-OF-THOUGHT REWARD MODELS

    Varshith Vijjapu, Krishiv Ray, Archana Vaidheeswaran · PDF
  19. I Can't Believe It Can't Count: Vision-Language Models Fail at Basic Enumeration Beyond the Subitizing Range

    Amirhossein Afsharrad, Seyed Shahabeddin Mousavi, Sanjay Lall · PDF
  20. I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

    Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha · PDF
  21. I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation

    Fay Elhassan, David Sasu, Lars Henning Klein, Alexandra V. Kulinkina, Mary-Anne Hartley · PDF
  22. I Can't Believe LLMs Still Can't Write Drama: Multi-Dimensional Failures in Script Continuation

    Shijian Ma, Yunqi Huang, Lin Yan · PDF
  23. Improving Proxy Transfer via Intermediate Proxy Tuning

    Kevin Kuo, Ayush Sehgal, Robert Pare, Virginia Smith · PDF
  24. Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

    Viliana Devbunova · PDF
  25. Knowing Is Not Seeing. Limits of Physical Problem Solving in VLMs

    Karim Elmaaroufi, Kevin Chon, Justin Svegliato, Lakshya A Agrawal, Matei Zaharia, Sanjit A. Seshia · PDF
  26. Language-Dependent Miscalibration in Multilingual LLM Evaluators

    Ej Zhou, Lucas Resck, Zheng Hui, Anna Korhonen · PDF
  27. Learning State-Tracking from Code: REPL Traces and Probabilistic Automata

    Julien Siems, Riccardo Grazzi, Kirill Kalinin, Hitesh Ballani, Babak Rahmani · PDF
  28. Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

    Suraj Yadav, Siddharth Yadav, Parth Goyal · PDF
  29. Lost in Translation: Why SOTA LLMs Struggle with French NLU Frontiers

    David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury · PDF
  30. More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression

    Aryan Sood, Tanvi Sharma, Vansh Agrawal · PDF
  31. NON-MONOTONICITY AND CATASTROPHIC RISK OF PROMPT INTERVENTIONS IN ADVERSARIAL LLM CONTROL

    Koki Inoue, Naoya Takashima, Hayato Fujihara, SHUYA HIGUCHI, Kota Shimomura, Ryuta Shimogauchi, Takayoshi Yamashita · PDF
  32. Not All Time Is Gregorian: Evaluating LLMs on Cultural Calendar Systems

    Deepon Halder, Adish Pandya, Raj Dabre · PDF
  33. One Step Forward, Two Steps Back: Regression Errors and Cost Inefficiencies in LLM Iterative Refinement for Code Generation

    Lucas Teixeira Borges, RICARDO RIOS · PDF
  34. Probing and Steering Chain-of-Thought Unfaithfulness in Language Models

    Giovanni Maria Occhipinti, Alessandro Abate, Nandi Schoots · PDF
  35. QuanBench Plus: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

    Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Hasan Abed Al Kader Hammoud, Ammar Mohanna, Bernard Ghanem · PDF
  36. Query Timing Produces Opposite Positional Biases Between LLMs and Humans

    Jasin Cekinmez, Addison J. Wu, Thomas L. Griffiths · PDF
  37. Random Is Hard to Beat: Active Selection in Online DPO with Modern LLMs

    Giyeong Oh, Junghyun Lee, Jaehyun Park, Youngjae Yu, Wonho Bae, Junhyug Noh · PDF
  38. Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

    Martin Asenov, Kenza Benkirane, Daniel Goldwater, Aneiss Ghodsi · PDF
  39. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    Ishaan Watts, Catherine Li, Sachin Goyal, Jacob Mitchell Springer, Aditi Raghunathan · PDF
  40. Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

    Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata, Kranthi Kiran GV, Wesley Tam, Bala Krishna S Vegesna · PDF
  41. Style over Substance: LLM-as-a-Judge Fails to Evaluate Multi-Party Social Dialogue

    Kunal Samanta, Faisal Tareque Shohan, Amine Trabelsi, Richard Khoury · PDF
  42. Synthetic Error Injection Fails to Elicit Self-Correction In Language Models

    David Xing Wu, Shreyas Kapur, Anant Sahai, Stuart Russell · PDF
  43. The $\Psi$ Paradox in Extreme Superposition: When ETF Alignment Does Not Predict Language Model Generalization

    Hyunjun Kim · PDF
  44. The Anatomy of Uncertainty in LLMs

    Aditya Taparia, Ransalu Senanayake, Kowshik Thopalli, Vivek Narayanaswamy · PDF
  45. The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning

    Yikun Zong, Cheston Tan · PDF
  46. The Cost of Consistency: Why Cross-Plane Contrastive Learning Fails to Bridge the Gap Between MedSAM-3 and nnU-Net

    Madhu Shree Aravindan, Aaditi V Bajpai, Ramamoorthy Sriramulu · PDF
  47. The Limits of Long-Context Reasoning in Automated Bug Fixing

    Ravi Shanker Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker · PDF
  48. The Low-Frequency Trap: Why Scaling Doesn't Solve Simple Temporal Counting

    Sarvesh Baskar, Muhammad R. Islam, Zikui Cai, Ankit Nakhawa, Anirudh Satheesh, Tom Goldstein, Furong Huang · PDF
  49. The Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries

    Nora Petrova, John Burden · PDF
  50. The Selective Safety Trap: How LLMs Scaling and Alignment Fail to Generalize Across Minority Demographics

    Iago Alves Brito, Walcy Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho · PDF
  51. Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

    Yueqian Lin, Zhengmian Hu, Qinsi Wang, Yudong Liu, Hengfan Zhang, Jayakumar Subramanian, Nikos Vlassis, Hai Helen Li, Yiran Chen · PDF
  52. When can you TRUST Large Language Models?

    Radu Paradovschi, Darvin Yi, Andrew Rabinovich, Zhao Chen · PDF
  53. When Lie Detectors Learn Model Identity: Confounds in Black-Box Sandbagging Detection

    Lin Yulong, Pablo Bernabeu-Perez, Benjamin Arnav, Lennie Wells, Mary Phuong · PDF
  54. When Rubrics Backfire: Systematic Preference Drift in LLM Judges

    Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Steven Wu, Zhun Deng · PDF
  55. WHEN STABILITY FAILS: HIDDEN FAILURE MODES OF LLMS IN DATA-CONSTRAINED SCIENTIFIC DECISION-MAKING

    Nazia Riasat · PDF
  56. Why Large Language Models Fail for Hausa Educational Content: Cascading Errors from Translation to Speech to Comprehension

    Honour-Jesus Bezaleel, Pearse Jim, Moses Daudu · PDF