ICML 2026 Past Large language modelsTheoryEvaluation & benchmarks

ICML 2026 Workshop on Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance

CTB@ICML 2026

Submission deadline
May 8, 2026, 23:59 AoE (UTC−12)
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (111)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Benchmarked Diagnostic for Sparse Decomposability of Dense Causal Subspaces

    Socrates Osorio, Joy Zheyun Yang · PDF
  2. A Cognitive Battery for Foundation Models: Theory-Grounded Benchmarks for Attention, Learning, Metacognition, Executive Function, and Social Cognition

    Zacharie Bugaud · PDF
  3. A Controlled Benchmark for Lag-Structured Dependency Motifs

    Bowen Qi · PDF
  4. A Numerical Study of Robustness Verification for Lightning Self-Attention

    Yulia Alexandr, Hao Duan, Guido Montufar · PDF
  5. A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation

    Hosna Oyarhoseini, Jimmy Lin, Amir-Hossein Karimi · PDF
  6. Active probabilistic reasoning in humans and LLMs

    Gonçalo Guiomar, Elia Torre, Pehuen Moure, Victoria Shavina, Mario Giulianelli, Shih-Chii Liu, Valerio Mante · PDF
  7. Aggregate Metrics Hide Shortcut Regimes: A Complexity-Stratified Benchmark for Novel View Synthesis

    Han Lee, Rohan Keyur Dalal, Irene Tang · PDF
  8. AIE-Bench: Benchmarking Agents That Build Agents

    Abhishek Mishra, Selvam Palanimalai, Yogendra Manawat, Samuel Verboomen, Prannay Hebbar, Damir Vrabac, Deepak Nathani, Sumeet Ramesh Motwani, Kunal Bhatia, Vignesh Baskaran · PDF
  9. AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs

    Pranay Goel, Aahana Basappa, Anusri Karra, Anish Karra, Kevin Zhu, Asa Gilmore · PDF
  10. BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

    Patrick Knab, Orgest Xhelili, Inis Buzi, Drago Andres Guggiana Nilo, Mohd Saquib Khan, Lorenz Kolb, Manuel Scherzer, Kerem Yildirir, Christian Bartelt, Philipp Johannes Schubert · PDF
  11. Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

    Yanan Long · PDF
  12. Benchmark Scores Rank Methods, Not Capabilities: Theory, Evidence, and Protocols for the Saturation-Collapse Cycle

    Dipam Paul · PDF
  13. Benchmarks Are Not Atomic: Composition-Aware LLM Evaluation using BenchHub

    Eunsu Kim, Haneul Yoo, Guijin Son, Hitesh Laxmichand Patel, Amit Agarwal, Alice Oh · PDF
  14. Beyond Answer Correctness: Measuring and Reducing Explanation Faithfulness Gaps in Chart Understanding VLMs

    Kshitij Dahiya, Dr. Vinay Kumar Saini · PDF
  15. Bounding Compositional Incoherence in Foundation Models

    Anany Kotawala · PDF
  16. Capacity-Gated Forgetting in LoRA Fine-Tuning: Rank, Proximity, and Endogenous Replay in Medical LLMs

    Akanksha Narula, Aaditya Sharma, Dharya Jasuja, Aditya Dhawan · PDF
  17. CellARC: An Oracle-Calibrated Benchmark for Few-Shot Rule Induction

    Miroslav Lžičař · PDF
  18. Certifiable Evaluation: A Low-Rank Framework for Foundation Model Benchmarking with Formal Performance Guarantees

    Siddharth Karuturi, Kaustubh S. Bukkapatnam, Laksh Patel, Tanush Ajay Shastry, Akshath Sharma, Mithil Shah, Matthew Park · PDF
  19. Certified Evaluation for LLMs in Optimization Modeling: From Graph Isomorphism to Formulation Isomorphism

    Zhuohan Wang, Ziwei Zhu, Ziniu Li, Congliang Chen, Zhihang Lin, MingZhe Yang, Yizhou Han, Yufeng Lin, Angyang Gu, Xinglin Hu, Ruoyu Sun, Tian Ding · PDF
  20. Choosing Training-Time Calibration Objectives for Frozen Foundation-Model Features: A Linear-Probing Benchmark

    Heejin Choi · PDF
  21. CLIP Models Generalize Less Than Compositional Benchmarks Suggest

    Shuman Peng, Arnas Uselis, Darina Koishigarina, Martin Ester, Seong Joon Oh · PDF
  22. Collaborative Adaptive Labeling with Imperfect Labelers and Selective Expert Escalation

    Xinrui Ruan, Nanshan Jia, Waverly Wei, Sui Huang, Zhenyu Zhao, Zeyu Zheng, Jingshen Wang · PDF
  23. Combining Theory and Benchmarks for Length Generalisation: Formal Certificates Meet Large-Scale Evaluation

    Zacharie Bugaud · PDF
  24. Conformalized Scaling Laws: Distribution-Free Prediction Intervals for Out-of-Distribution Compute Regimes

    Kaustubh S. Bukkapatnam, Siddharth Karuturi · PDF
  25. Constructing Thunder Korean Benchmark Suite for Reliable Evaluation of Foundation Models

    Yeonkyoung So, Jongmin Kim, Sungmok Jung, Gyuseong Lee, Sangho Kim, Jongyeon Park, Joonhak Lee, Seho Pyo, Gyeongje Cho, Seorin Kim, Jisoo Kim, Suyoung Park, Hyunji M. Park, Yelim Ahn, Yeongho Seo, Jaejin Lee · PDF
  26. Context Over Content: Exposing Evaluation Faking in Automated Judges

    Manan Gupta · PDF
  27. Context Saturation in Zero-Shot Time-Series Foundation Models

    Miguel Nogales, Luca Butera, Alberto Ferrante, Cesare Alippi · PDF
  28. Contextual Observability and Grammar Singularity for Compositional Task Families

    Manoj Saravanan, Rohit Kumar Salla, Shrikar Reddy Kota · PDF
  29. ContinuityBench: A Framework and Taxonomy for Evaluating Agent Recovery from Interrupted State

    Aryan Gulati · PDF
  30. Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

    Xingyu Ren, Youran Sun, Haoyu Liang · PDF
  31. Correcting Optimizer Selection Bias via Large Deviation Hazards

    Andrea Zerio, Andres R Masegosa · PDF
  32. Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering

    Mary Llewellyn, Isobel Thornton, James Bishop, Annie Gray · PDF
  33. Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension

    Amanda Bertsch, Luca Soldaini, Matthew R. Gormley, Graham Neubig, Hannaneh Hajishirzi, Kyle Lo, Dirk Groeneveld · PDF
  34. Cross-Language Evaluation of Prompt Inversion: Similarity Metrics, Decoding Strategies, and Prefix Sensitivity in Japanese and English

    Yusei Kitamura, Ahmad Akmal Aminuddin Mohd Kamal, Masaya Fujisawa · PDF
  35. DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs

    Art Kanke · PDF
  36. Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

    Manan Gupta, Dhruv Kumar · PDF
  37. EditCLEVR: A Paired-Scene Intervention Benchmark for Compositional Faithfulness of Object-Centric Representations

    Anuraag Gadehothur Karnam, Tarunesh Sathish · PDF
  38. Efficient Safety Benchmarking via Item Response Theory

    Fabio Spagliardi, Mírian Silva, Ayan Datta, Aiden Zhou, Vamshi Krishna Bonagiri, Diogo Cruz · PDF
  39. Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks

    Xunlei Qian, Yue Xing · PDF
  40. Estimating Pass@$k$ from Fewer Samples with Hierarchical Bayesian Priors

    Alexandre Verine, Florian Le Bronnec, Benjamin Negrevergne, Alexandre Allauzen · PDF
  41. Evaluating LLM Reasoning on Operating System Algorithms via Step-Level Verification

    Jalluri Mahesh Kumar, Junjunoori Sri Chakri, Yash Kothari, Murari Mandal, Yash Sinha, Dhruv Kumar · PDF
  42. Evaluator Failure Modes in Agentic Uncertainty Quantification

    Suresh Raghu, Satwik Pandey, Shashwat Pandey · PDF
  43. Executable Ground Truth: A Closed-Loop Benchmark for Evaluating LLM Agents on Microservice Incident Remediation

    Dhatri C, Tadisetty Sai Yashwanth · PDF
  44. Fast Inference via Hierarchical Speculative Decoding

    Clara Mohri, Amir Globerson, Haim Kaplan, Yishay Mansour, Tal Schuster · PDF
  45. Feedforward Mixing is as Sharp as it is Slow in Reverse

    Benedict Aaron Tjandra, Avi Wigderson, João G. M. Araújo, Alex Vitvitskyi, Federico Barbero, Petar Veličković · PDF
  46. FormalImG: Evaluating Structural Compositional Generalization for T2I Models

    Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Zhi-Fan Wu, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li · PDF
  47. FRAME: Framework for Robotic Action and Motion Evaluation

    Ameya Wagh, Vishnu Rudrasamudram · PDF
  48. From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA

    Sena Korkut, Maria Alejandra Bravo, Sanghwan Kim, Zeynep Akata · PDF
  49. From Forecast Scores to Auditable Benchmarks: WorldFork for LLM Forecasting Evaluation

    Hanson Wen, James Gui · PDF
  50. From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for VLM Weak Supervision Across Three Medical-Imaging Benchmarks

    Bruce Changlong Xu, Jose K. James, Alexander J Ryu · PDF
  51. Functional Subspace, where language models can use vector algebra to solve problems

    Jung H. Lee, Sujith Vijayan · PDF
  52. Fuzzy-Clustered Mixture-of-Experts with Relational Regularization for Interpretable Subgroup Modeling under Data Scarcity

    Chien-Hung Lai, Yuh-Shyan Hwang, Yi Lin · PDF
  53. GapPO: Gradient-Adaptive Pairwise Preference Optimization

    Michelle Chang, Xiaodi Sun, Ethan C. Chau, Zhaoqiong Huang, Arpita Das, Izzie Lau, Liyuan Zheng, Huancheng Chen, Jingwen Lu · PDF
  54. Generalized Priority-Aware Shapley Value

    Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang · PDF
  55. Generative vs Discriminative? Revisiting the shortcut learning debate in text classification

    Siva Rajesh Kasa, Karthik Raavi, Sumegh Roychowdhury, Pattisapu Nikhil Priyatam, Ashutosh Kumar, Yaswanth Biruduraju, Santhosh Kumar Kasa, Ankith M S, Sumit Negi · PDF
  56. GraphStateEval: A Step-by-Step Evaluation Framework for Graph Algorithm Execution in Large Language Models via Intermediate State Tracing

    Kanav Kapoor, Dhruv Kumar, Jagat Sesh Challa, Murari Mandal, Yash Sinha · PDF
  57. Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench

    Vikhyath Kothamasu, Virginia Smith, Chhavi Yadav · PDF
  58. Hidden Sensitivity in Spatial Reasoning Evaluation: Diagnosis and Re-ranking with VSI-Bench

    Phillip Y. Lee, Jin Yoo, Minseo Kim, Leonidas Guibas, Minhyuk Sung · PDF
  59. Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

    Mark Vero, Fabian Kaczmarczyck, Ivan Petrov, Ilia Shumailov, Niels Heinen, Jamie Hayes, Tianqi Fan, Luca Invernizzi, Martin Vechev · PDF
  60. How good is your harness?

    Jiwoo Han, Yuekai Sun · PDF
  61. How long is a piece of string? A brief empirical analysis of tokenizers

    Jonathan Roberts, Kai Han, Samuel Albanie · PDF
  62. Identifying Efficient Queries for Black-Box Model Classification

    Merrick Ohata, Carey Priebe, Hayden Helm · PDF
  63. Instance-Optimal Estimation with Multiple LLM Judges on a Budget

    Junghyun Lee, Sanghwa Kim, Yassir Jedra, Alexandre Proutiere, Se-Young Yun · PDF
  64. Instruction Bleed: A Theory-Anchored Benchmark for Cross-Module Interference in Prompt-Composed Agents

    Ching-Yu Lin, Yifan Liu · PDF
  65. Interactive Evaluation Requires a Design Science

    Keyang Xuan, Peiyang Song, Pan Lu, Pengrui Han, Wenkai Li, Zhenyu Zhang, Zexue He, Wenyue Hua, Manling Li, Jiaxuan You, Adrian Weller, Yizhong Wang, Jiaxin Pei · PDF
  66. Internal Data Repetition Destroys Language Models

    Jessica Chudnovsky, Joshua Kazdan, Noam Itzhak Levi, Rylan Schaeffer, Yegor Denisov-Blanch, Sanmi Koyejo, David L. Donoho · PDF
  67. Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

    Manan Gupta, Dhruv Kumar · PDF
  68. LoopNav: Benchmarking Spatial Consistency in World Models

    Kewei Lian, Shaofei Cai, Yitao Liang, Anji Liu · PDF
  69. m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

    Yosub Shin, Michael Buriek, Igor Molybog · PDF
  70. Measuring the Limits of Continual Learning for LLMs

    Nimit Kalra, Narutatsu Ri, Zerzar Bukhari, Ang Li, Sanae Lotfi, Liam H Fowl, Micah Goldblum · PDF
  71. Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

    YIDING SONG, Hanming Ye · PDF
  72. MultiVulnBench: A Large-Scale Benchmark for Count Bias in LLM-Based Multi-Vulnerability Detection

    Manan Gupta, Chinmay Pushkar, Sanchit Kabra, Dhruv Kumar, Jagat Sesh Challa · PDF
  73. Null-Calibrated Evaluation of Sparse Autoencoder Decoder Reproducibility

    Bright Liu · PDF
  74. On Cost-Effective LLM-as-a-Judge Improvement Techniques

    Ryan Lail, Luke Markham · PDF
  75. On the Rotation-Equivariance Geometry of Tabular Foundation Models

    Mert Ogul · PDF
  76. Operads for compositional reasoning in LLMs

    Nathaniel Bottman, Kyle Richardson · PDF
  77. Perplexity Cannot Always Tell Right from Wrong

    Petar Veličković, Federico Barbero, Christos Perivolaropoulos, Simon Osindero, Razvan Pascanu · PDF
  78. Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

    Zexin Zhuang, Yanhang Li, Zhichao Fan · PDF
  79. Probabilistic Chain-of-Thought: Sequential Bayesian Inference over Latent Reasoning Correctness

    Suriya Dev Saravanakumar, Ezra Matiwos Wesenie, Kishore Nuthalapati, Laksh Patel · PDF
  80. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

    Xing Zhang, Guanghui Wang, Yanwei CUI, Qucy Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He · PDF
  81. PromptSplit: Revealing Prompt-Level Disagreement in Generative Models

    Mehdi Lotfian, Mohammad Jalali, Farzan Farnia · PDF
  82. Quantifying Empirical Compute-Supervision Tradeoffs in RLVR

    Ryo Mitsuhashi, Patrick Chen, Isabelle Tseng, Jasin Cekinmez, Addison J. Wu · PDF
  83. Rethinking FID Through the Geometry of the Reference Dataset

    Yunghee Lee, Byeonghyun Pak · PDF
  84. Rethinking LLM Confidence: From Calibration to Coherence

    Krish Matta, Atharv Naphade, Andy Zou · PDF
  85. Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

    Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez · PDF
  86. Retrieval Dwelling: A Principled Sampling Strategy for Exploiting Spurious State Exploration

    Rohit Sinha, Saroj Kumar · PDF
  87. SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

    Yanhang Li, Zhichao Fan, Zexin Zhuang · PDF
  88. Scale Dependent Data Duplication

    Joshua Kazdan, Noam Itzhak Levi, Rylan Schaeffer, Jessica Chudnovsky, Abhay Puri, Bo He, Mehmet Donmez, Sanmi Koyejo, David L. Donoho · PDF
  89. Selective Perturbations as a Diagnostic for Benchmark-Based LLM Comparisons

    Ivan Dubrovsky, Anastasia Orlova, Nina Gubina, Illarion Iov, Irena Gureeva, Nikolay Nikitin, Alexey Zaytsev · PDF
  90. SemanticSRJudge: Spatially-Grounded VLM Evaluation for Super-Resolution Quality Assessment

    Vishwajeet Shukla, Ankit Dhankhar, Ajay Bedi · PDF
  91. ShiftBench: A Benchmark for Per-Cohort Certify-or-Abstain Decisions on Positive Predictive Value Under Covariate Shift

    Ananya Salian · PDF
  92. Simulating Field Experiments for Method Testing

    Enoch H. Kang · PDF
  93. Spectral Signatures of Large Language Models

    Zhuoying Zhang, Ishan Verma Prasad, Zihang Liu, Yuanzhe Hu, Hengrui Luo, Pu Ren, Yaoqing Yang · PDF
  94. Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models

    Sethuraman T V, Savya Khosla, Aditi Tiwari, Vidya Ganesh, Rakshana Jayaprakash, Aditya Jain, Vignesh Srinivasakumar, Onkar Kishor Susladkar, Joey Wang, Srinidhi Sunkara, Aditya Shanmugham, Abbaas Alif Mohamed Nishar, Rakesh Vaideeswaran, Simon Jenni, Derek Hoiem · PDF
  95. Stress-Testing Neural Network Verifiers with Provably Robust Instances

    David Troxell, Yulia Alexandr, Sofia Hunt, Stephanie Lei, Guido Montufar · PDF
  96. Style Conventions Override Performance Predictions in Coding LLMs

    Matthew Kotzbauer · PDF
  97. Symmetries of Functional Processes under Label Noise

    Abhra Chaudhuri, Pedro Gomes · PDF
  98. Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

    Everett Richards · PDF
  99. The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

    Shubh Chapra, Dhruv Kumar, Murari Mandal, Yash Sinha · PDF
  100. The Propagation Field: A Geometric Substrate Theory of Deep Learning

    Xingrui Gu · PDF
  101. The Shape of Noise: Layer-Wise Perturbation Profiles for Diagnosing Vision Robustness

    Son Nguyen, Gia-Bao Vu, Quang Minh Phan, Trong P. Le · PDF
  102. Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization

    Yash Ganpat Sawant · PDF
  103. Toward Trustworthy LLM–GNN Fusion: A Fusion-Aware Evaluation and Reporting Framework

    Zhifei Hu, Alexandra I. Cristea · PDF
  104. Trace-Aware Routing for Cost-Effective Human–AI Collaborative Labeling

    Waverly Wei, Xinrui Ruan, Zhenyu Zhao, Sui Huang, Zeyu Zheng, Jingshen Wang · PDF
  105. Universality, Composition Generalization, and Algorithm Emulation All In-Context

    Jerry Yao-Chieh Hu, Hong-Yu Chen, Po-Chiao Lin, Maojiang Su, Han Liu · PDF
  106. Uplifting Human Decision Making in AI Evaluation by Automating Benchmark Validity Analysis

    Rodolfo Corona, Sang T. Truong, Ritwik Gupta, Nhi Ngoc Truong, Atnafu Lambebo Tonja, Mena Attia, Fahim Faisal, Kaushal Kumar Maurya, Fred Philippy, Belu Ticona, Sumaya Nur Adan, Fazl Barez, Omar Florez, Supheakmungkol Sarin, Aseem Srivastava, Xiaoyuan Yi, Nick Haber, Dan Klein, Thamar Solorio, Xing Xie, Sanmi Koyejo, Robert Trager · PDF
  107. When Agreement Becomes Unsafe: Loss-Aware Energy Control for Diagnostic Deliberation

    Yuting Yan, Yinghao Fu, Haozhou Gao, Tianjian Zhang, Aoxi Liu, Shuang Li · PDF
  108. When Does Polynomial Attention Concentrate? A Relative-Margin Diagnostic for Zero-Shot Softmax Substitution

    Sanny Kim · PDF
  109. Where’s the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

    Nicole H. Ma, Nick Rui · PDF
  110. YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

    Muyu He, Vincent Tu, Adit Jain, Anand Kumar, Sachin Patro, Soumyadeep Bakshi, Nazneen Rajani · PDF
  111. You're reading LLM leaderboards wrong: Disentangling models from pipelines in engineering benchmarks

    Marius Tacke, Shivam Suri, Matthias Busch, Mahish K. Guru, Christian J Cyron, Roland Aydin · PDF