ICLR 2026 Past Large language models
I Can't Believe It's Not Better: Where Large Language Models Need to Improve
ICLR 2026 Workshop ICBINB
- Submission deadline
- Jan 31, 2026, 23:59 AoE (UTC−12) imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (56)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Pilot Study on Doubt Robustness of LLMs in Clinical Prediction Explanation
-
AI-rithmetic
-
Attention Sinks as Internal Signals for Hallucination Detection in Large Language Models
-
Barriers to Pareto Steerability in Preference-Conditioned LLM Alignment
-
Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs
-
Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models
-
Bigger Is Not Better Under Differential Privacy: Optimization Failure at Eleven-Billion Scale in Vision–Language Model Fine-Tuning
-
Can LLMs Perceive Time? An Empirical Investigation
-
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
-
Challenges in Inference-Time Scaling with Uncertainty-Aware Tree Search
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
-
Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
-
EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
-
EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
-
Evaluating Ill-Defined Tasks in Large Language Models
-
Evaluation-Conditioned Trojan Attack
-
Fairness Failure Modes of Multimodal LLMs
-
FLUFFINJECTOR: DIAGNOSING LOGICAL CONSISTENCY FAILURES IN CHAIN-OF-THOUGHT REWARD MODELS
-
I Can't Believe It Can't Count: Vision-Language Models Fail at Basic Enumeration Beyond the Subitizing Range
-
I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift
-
I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation
-
I Can't Believe LLMs Still Can't Write Drama: Multi-Dimensional Failures in Script Continuation
-
Improving Proxy Transfer via Intermediate Proxy Tuning
-
Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
-
Knowing Is Not Seeing. Limits of Physical Problem Solving in VLMs
-
Language-Dependent Miscalibration in Multilingual LLM Evaluators
-
Learning State-Tracking from Code: REPL Traces and Probabilistic Automata
-
Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs
-
Lost in Translation: Why SOTA LLMs Struggle with French NLU Frontiers
-
More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression
-
NON-MONOTONICITY AND CATASTROPHIC RISK OF PROMPT INTERVENTIONS IN ADVERSARIAL LLM CONTROL
-
Not All Time Is Gregorian: Evaluating LLMs on Cultural Calendar Systems
-
One Step Forward, Two Steps Back: Regression Errors and Cost Inefficiencies in LLM Iterative Refinement for Code Generation
-
Probing and Steering Chain-of-Thought Unfaithfulness in Language Models
-
QuanBench Plus: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
-
Query Timing Produces Opposite Positional Biases Between LLMs and Humans
-
Random Is Hard to Beat: Active Selection in Online DPO with Modern LLMs
-
Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG
-
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
-
Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA
-
Style over Substance: LLM-as-a-Judge Fails to Evaluate Multi-Party Social Dialogue
-
Synthetic Error Injection Fails to Elicit Self-Correction In Language Models
-
The $\Psi$ Paradox in Extreme Superposition: When ETF Alignment Does Not Predict Language Model Generalization
-
The Anatomy of Uncertainty in LLMs
-
The Continuous Space Gap: Why VLMs Fail in Continuous Geometric Reasoning
-
The Cost of Consistency: Why Cross-Plane Contrastive Learning Fails to Bridge the Gap Between MedSAM-3 and nnU-Net
-
The Limits of Long-Context Reasoning in Automated Bug Fixing
-
The Low-Frequency Trap: Why Scaling Doesn't Solve Simple Temporal Counting
-
The Missing Red Line: How Commercial Pressure Erodes AI Safety Boundaries
-
The Selective Safety Trap: How LLMs Scaling and Alignment Fail to Generalize Across Minority Demographics
-
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
-
When can you TRUST Large Language Models?
-
When Lie Detectors Learn Model Identity: Confounds in Black-Box Sandbagging Detection
-
When Rubrics Backfire: Systematic Preference Drift in LLM Judges
-
WHEN STABILITY FAILS: HIDDEN FAILURE MODES OF LLMS IN DATA-CONSTRAINED SCIENTIFIC DECISION-MAKING
-
Why Large Language Models Fail for Hausa Educational Content: Cascading Errors from Translation to Speech to Comprehension