ICML 2026PastLarge language modelsTheoryEvaluation & benchmarks

ICML 2026 Workshop on Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance

CTB@ICML 2026

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: May 9, 2026, 11:59 UTC
OpenReview-synced 2026-05-09 11:59 UTC (as of 2026-06-23) — extensions on OpenReview are applied automatically; verify on the website.
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (111)

Fetched from OpenReview (v2) on 2026-06-10.

A Benchmarked Diagnostic for Sparse Decomposability of Dense Causal Subspaces
Socrates Osorio, Joy Zheyun Yang · PDF
A Cognitive Battery for Foundation Models: Theory-Grounded Benchmarks for Attention, Learning, Metacognition, Executive Function, and Social Cognition
Zacharie Bugaud · PDF
A Controlled Benchmark for Lag-Structured Dependency Motifs
Bowen Qi · PDF
A Numerical Study of Robustness Verification for Lightning Self-Attention
Yulia Alexandr, Hao Duan, Guido Montufar · PDF
A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation
Hosna Oyarhoseini, Jimmy Lin, Amir-Hossein Karimi · PDF
Active probabilistic reasoning in humans and LLMs
Gonçalo Guiomar, Elia Torre, Pehuen Moure, Victoria Shavina, Mario Giulianelli, Shih-Chii Liu, Valerio Mante · PDF
Aggregate Metrics Hide Shortcut Regimes: A Complexity-Stratified Benchmark for Novel View Synthesis
Han Lee, Rohan Keyur Dalal, Irene Tang · PDF
AIE-Bench: Benchmarking Agents That Build Agents
Abhishek Mishra, Selvam Palanimalai, Yogendra Manawat, Samuel Verboomen, Prannay Hebbar, Damir Vrabac, Deepak Nathani, Sumeet Ramesh Motwani, Kunal Bhatia, Vignesh Baskaran · PDF
AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs
Pranay Goel, Aahana Basappa, Anusri Karra, Anish Karra, Kevin Zhu, Asa Gilmore · PDF
BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding
Patrick Knab, Orgest Xhelili, Inis Buzi, Drago Andres Guggiana Nilo, Mohd Saquib Khan, Lorenz Kolb, Manuel Scherzer, Kerem Yildirir, Christian Bartelt, Philipp Johannes Schubert · PDF
Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
Yanan Long · PDF
Benchmark Scores Rank Methods, Not Capabilities: Theory, Evidence, and Protocols for the Saturation-Collapse Cycle
Dipam Paul · PDF
Benchmarks Are Not Atomic: Composition-Aware LLM Evaluation using BenchHub
Eunsu Kim, Haneul Yoo, Guijin Son, Hitesh Laxmichand Patel, Amit Agarwal, Alice Oh · PDF
Beyond Answer Correctness: Measuring and Reducing Explanation Faithfulness Gaps in Chart Understanding VLMs
Kshitij Dahiya, Dr. Vinay Kumar Saini · PDF
Bounding Compositional Incoherence in Foundation Models
Anany Kotawala · PDF
Capacity-Gated Forgetting in LoRA Fine-Tuning: Rank, Proximity, and Endogenous Replay in Medical LLMs
Akanksha Narula, Aaditya Sharma, Dharya Jasuja, Aditya Dhawan · PDF
CellARC: An Oracle-Calibrated Benchmark for Few-Shot Rule Induction
Miroslav Lžičař · PDF
Certifiable Evaluation: A Low-Rank Framework for Foundation Model Benchmarking with Formal Performance Guarantees
Siddharth Karuturi, Kaustubh S. Bukkapatnam, Laksh Patel, Tanush Ajay Shastry, Akshath Sharma, Mithil Shah, Matthew Park · PDF
Certified Evaluation for LLMs in Optimization Modeling: From Graph Isomorphism to Formulation Isomorphism
Zhuohan Wang, Ziwei Zhu, Ziniu Li, Congliang Chen, Zhihang Lin, MingZhe Yang, Yizhou Han, Yufeng Lin, Angyang Gu, Xinglin Hu, Ruoyu Sun, Tian Ding · PDF
Choosing Training-Time Calibration Objectives for Frozen Foundation-Model Features: A Linear-Probing Benchmark
Heejin Choi · PDF
CLIP Models Generalize Less Than Compositional Benchmarks Suggest
Shuman Peng, Arnas Uselis, Darina Koishigarina, Martin Ester, Seong Joon Oh · PDF
Collaborative Adaptive Labeling with Imperfect Labelers and Selective Expert Escalation
Xinrui Ruan, Nanshan Jia, Waverly Wei, Sui Huang, Zhenyu Zhao, Zeyu Zheng, Jingshen Wang · PDF
Combining Theory and Benchmarks for Length Generalisation: Formal Certificates Meet Large-Scale Evaluation
Zacharie Bugaud · PDF
Conformalized Scaling Laws: Distribution-Free Prediction Intervals for Out-of-Distribution Compute Regimes
Kaustubh S. Bukkapatnam, Siddharth Karuturi · PDF
Constructing Thunder Korean Benchmark Suite for Reliable Evaluation of Foundation Models
Yeonkyoung So, Jongmin Kim, Sungmok Jung, Gyuseong Lee, Sangho Kim, Jongyeon Park, Joonhak Lee, Seho Pyo, Gyeongje Cho, Seorin Kim, Jisoo Kim, Suyoung Park, Hyunji M. Park, Yelim Ahn, Yeongho Seo, Jaejin Lee · PDF
Context Over Content: Exposing Evaluation Faking in Automated Judges
Manan Gupta · PDF
Context Saturation in Zero-Shot Time-Series Foundation Models
Miguel Nogales, Luca Butera, Alberto Ferrante, Cesare Alippi · PDF
Contextual Observability and Grammar Singularity for Compositional Task Families
Manoj Saravanan, Rohit Kumar Salla, Shrikar Reddy Kota · PDF
ContinuityBench: A Framework and Taxonomy for Evaluating Agent Recovery from Interrupted State
Aryan Gulati · PDF
Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB
Xingyu Ren, Youran Sun, Haoyu Liang · PDF
Correcting Optimizer Selection Bias via Large Deviation Hazards
Andrea Zerio, Andres R Masegosa · PDF
Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering
Mary Llewellyn, Isobel Thornton, James Bishop, Annie Gray · PDF
Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension
Amanda Bertsch, Luca Soldaini, Matthew R. Gormley, Graham Neubig, Hannaneh Hajishirzi, Kyle Lo, Dirk Groeneveld · PDF
Cross-Language Evaluation of Prompt Inversion: Similarity Metrics, Decoding Strategies, and Prefix Sensitivity in Japanese and English
Yusei Kitamura, Ahmad Akmal Aminuddin Mohd Kamal, Masaya Fujisawa · PDF
DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs
Art Kanke · PDF
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Manan Gupta, Dhruv Kumar · PDF
EditCLEVR: A Paired-Scene Intervention Benchmark for Compositional Faithfulness of Object-Centric Representations
Anuraag Gadehothur Karnam, Tarunesh Sathish · PDF
Efficient Safety Benchmarking via Item Response Theory
Fabio Spagliardi, Mírian Silva, Ayan Datta, Aiden Zhou, Vamshi Krishna Bonagiri, Diogo Cruz · PDF
Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks
Xunlei Qian, Yue Xing · PDF
Estimating Pass@$k$ from Fewer Samples with Hierarchical Bayesian Priors
Alexandre Verine, Florian Le Bronnec, Benjamin Negrevergne, Alexandre Allauzen · PDF
Evaluating LLM Reasoning on Operating System Algorithms via Step-Level Verification
Jalluri Mahesh Kumar, Junjunoori Sri Chakri, Yash Kothari, Murari Mandal, Yash Sinha, Dhruv Kumar · PDF
Evaluator Failure Modes in Agentic Uncertainty Quantification
Suresh Raghu, Satwik Pandey, Shashwat Pandey · PDF
Executable Ground Truth: A Closed-Loop Benchmark for Evaluating LLM Agents on Microservice Incident Remediation
Dhatri C, Tadisetty Sai Yashwanth · PDF
Fast Inference via Hierarchical Speculative Decoding
Clara Mohri, Amir Globerson, Haim Kaplan, Yishay Mansour, Tal Schuster · PDF
Feedforward Mixing is as Sharp as it is Slow in Reverse
Benedict Aaron Tjandra, Avi Wigderson, João G. M. Araújo, Alex Vitvitskyi, Federico Barbero, Petar Veličković · PDF
FormalImG: Evaluating Structural Compositional Generalization for T2I Models
Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang, Zhi-Fan Wu, Lin-Han Jia, Lan-Zhe Guo, Yu-Feng Li · PDF
FRAME: Framework for Robotic Action and Motion Evaluation
Ameya Wagh, Vishnu Rudrasamudram · PDF
From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
Sena Korkut, Maria Alejandra Bravo, Sanghwan Kim, Zeynep Akata · PDF
From Forecast Scores to Auditable Benchmarks: WorldFork for LLM Forecasting Evaluation
Hanson Wen, James Gui · PDF
From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for VLM Weak Supervision Across Three Medical-Imaging Benchmarks
Bruce Changlong Xu, Jose K. James, Alexander J Ryu · PDF
Functional Subspace, where language models can use vector algebra to solve problems
Jung H. Lee, Sujith Vijayan · PDF
Fuzzy-Clustered Mixture-of-Experts with Relational Regularization for Interpretable Subgroup Modeling under Data Scarcity
Chien-Hung Lai, Yuh-Shyan Hwang, Yi Lin · PDF
GapPO: Gradient-Adaptive Pairwise Preference Optimization
Michelle Chang, Xiaodi Sun, Ethan C. Chau, Zhaoqiong Huang, Arpita Das, Izzie Lau, Liyuan Zheng, Huancheng Chen, Jingwen Lu · PDF
Generalized Priority-Aware Shapley Value
Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang · PDF
Generative vs Discriminative? Revisiting the shortcut learning debate in text classification
Siva Rajesh Kasa, Karthik Raavi, Sumegh Roychowdhury, Pattisapu Nikhil Priyatam, Ashutosh Kumar, Yaswanth Biruduraju, Santhosh Kumar Kasa, Ankith M S, Sumit Negi · PDF
GraphStateEval: A Step-by-Step Evaluation Framework for Graph Algorithm Execution in Large Language Models via Intermediate State Tracing
Kanav Kapoor, Dhruv Kumar, Jagat Sesh Challa, Murari Mandal, Yash Sinha · PDF
Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench
Vikhyath Kothamasu, Virginia Smith, Chhavi Yadav · PDF
Hidden Sensitivity in Spatial Reasoning Evaluation: Diagnosis and Re-ranking with VSI-Bench
Phillip Y. Lee, Jin Yoo, Minseo Kim, Leonidas Guibas, Minhyuk Sung · PDF
Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots
Mark Vero, Fabian Kaczmarczyck, Ivan Petrov, Ilia Shumailov, Niels Heinen, Jamie Hayes, Tianqi Fan, Luca Invernizzi, Martin Vechev · PDF
How good is your harness?
Jiwoo Han, Yuekai Sun · PDF
How long is a piece of string? A brief empirical analysis of tokenizers
Jonathan Roberts, Kai Han, Samuel Albanie · PDF
Identifying Efficient Queries for Black-Box Model Classification
Merrick Ohata, Carey Priebe, Hayden Helm · PDF
Instance-Optimal Estimation with Multiple LLM Judges on a Budget
Junghyun Lee, Sanghwa Kim, Yassir Jedra, Alexandre Proutiere, Se-Young Yun · PDF
Instruction Bleed: A Theory-Anchored Benchmark for Cross-Module Interference in Prompt-Composed Agents
Ching-Yu Lin, Yifan Liu · PDF
Interactive Evaluation Requires a Design Science
Keyang Xuan, Peiyang Song, Pan Lu, Pengrui Han, Wenkai Li, Zhenyu Zhang, Zexue He, Wenyue Hua, Manling Li, Jiaxuan You, Adrian Weller, Yizhong Wang, Jiaxin Pei · PDF
Internal Data Repetition Destroys Language Models
Jessica Chudnovsky, Joshua Kazdan, Noam Itzhak Levi, Rylan Schaeffer, Yegor Denisov-Blanch, Sanmi Koyejo, David L. Donoho · PDF
Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering
Manan Gupta, Dhruv Kumar · PDF
LoopNav: Benchmarking Spatial Consistency in World Models
Kewei Lian, Shaofei Cai, Yitao Liang, Anji Liu · PDF
m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning
Yosub Shin, Michael Buriek, Igor Molybog · PDF
Measuring the Limits of Continual Learning for LLMs
Nimit Kalra, Narutatsu Ri, Zerzar Bukhari, Ang Li, Sanae Lotfi, Liam H Fowl, Micah Goldblum · PDF
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
YIDING SONG, Hanming Ye · PDF
MultiVulnBench: A Large-Scale Benchmark for Count Bias in LLM-Based Multi-Vulnerability Detection
Manan Gupta, Chinmay Pushkar, Sanchit Kabra, Dhruv Kumar, Jagat Sesh Challa · PDF
Null-Calibrated Evaluation of Sparse Autoencoder Decoder Reproducibility
Bright Liu · PDF
On Cost-Effective LLM-as-a-Judge Improvement Techniques
Ryan Lail, Luke Markham · PDF
On the Rotation-Equivariance Geometry of Tabular Foundation Models
Mert Ogul · PDF
Operads for compositional reasoning in LLMs
Nathaniel Bottman, Kyle Richardson · PDF
Perplexity Cannot Always Tell Right from Wrong
Petar Veličković, Federico Barbero, Christos Perivolaropoulos, Simon Osindero, Razvan Pascanu · PDF
Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit
Zexin Zhuang, Yanhang Li, Zhichao Fan · PDF
Probabilistic Chain-of-Thought: Sequential Bayesian Inference over Latent Reasoning Correctness
Suriya Dev Saravanakumar, Ezra Matiwos Wesenie, Kishore Nuthalapati, Laksh Patel · PDF
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
Xing Zhang, Guanghui Wang, Yanwei CUI, Qucy Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He · PDF
PromptSplit: Revealing Prompt-Level Disagreement in Generative Models
Mehdi Lotfian, Mohammad Jalali, Farzan Farnia · PDF
Quantifying Empirical Compute-Supervision Tradeoffs in RLVR
Ryo Mitsuhashi, Patrick Chen, Isabelle Tseng, Jasin Cekinmez, Addison J. Wu · PDF
Rethinking FID Through the Geometry of the Reference Dataset
Yunghee Lee, Byeonghyun Pak · PDF
Rethinking LLM Confidence: From Calibration to Coherence
Krish Matta, Atharv Naphade, Andy Zou · PDF
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez · PDF
Retrieval Dwelling: A Principled Sampling Strategy for Exploiting Spurious State Exploration
Rohit Sinha, Saroj Kumar · PDF
SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks
Yanhang Li, Zhichao Fan, Zexin Zhuang · PDF
Scale Dependent Data Duplication
Joshua Kazdan, Noam Itzhak Levi, Rylan Schaeffer, Jessica Chudnovsky, Abhay Puri, Bo He, Mehmet Donmez, Sanmi Koyejo, David L. Donoho · PDF
Selective Perturbations as a Diagnostic for Benchmark-Based LLM Comparisons
Ivan Dubrovsky, Anastasia Orlova, Nina Gubina, Illarion Iov, Irena Gureeva, Nikolay Nikitin, Alexey Zaytsev · PDF
SemanticSRJudge: Spatially-Grounded VLM Evaluation for Super-Resolution Quality Assessment
Vishwajeet Shukla, Ankit Dhankhar, Ajay Bedi · PDF
ShiftBench: A Benchmark for Per-Cohort Certify-or-Abstain Decisions on Positive Predictive Value Under Covariate Shift
Ananya Salian · PDF
Simulating Field Experiments for Method Testing
Enoch H. Kang · PDF
Spectral Signatures of Large Language Models
Zhuoying Zhang, Ishan Verma Prasad, Zihang Liu, Yuanzhe Hu, Hengrui Luo, Pu Ren, Yaoqing Yang · PDF
Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models
Sethuraman T V, Savya Khosla, Aditi Tiwari, Vidya Ganesh, Rakshana Jayaprakash, Aditya Jain, Vignesh Srinivasakumar, Onkar Kishor Susladkar, Joey Wang, Srinidhi Sunkara, Aditya Shanmugham, Abbaas Alif Mohamed Nishar, Rakesh Vaideeswaran, Simon Jenni, Derek Hoiem · PDF
Stress-Testing Neural Network Verifiers with Provably Robust Instances
David Troxell, Yulia Alexandr, Sofia Hunt, Stephanie Lei, Guido Montufar · PDF
Style Conventions Override Performance Predictions in Coding LLMs
Matthew Kotzbauer · PDF
Symmetries of Functional Processes under Label Noise
Abhra Chaudhuri, Pedro Gomes · PDF
Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection
Everett Richards · PDF
The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling
Shubh Chapra, Dhruv Kumar, Murari Mandal, Yash Sinha · PDF
The Propagation Field: A Geometric Substrate Theory of Deep Learning
Xingrui Gu · PDF
The Shape of Noise: Layer-Wise Perturbation Profiles for Diagnosing Vision Robustness
Son Nguyen, Gia-Bao Vu, Quang Minh Phan, Trong P. Le · PDF
Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization
Yash Ganpat Sawant · PDF
Toward Trustworthy LLM–GNN Fusion: A Fusion-Aware Evaluation and Reporting Framework
Zhifei Hu, Alexandra I. Cristea · PDF
Trace-Aware Routing for Cost-Effective Human–AI Collaborative Labeling
Waverly Wei, Xinrui Ruan, Zhenyu Zhao, Sui Huang, Zeyu Zheng, Jingshen Wang · PDF
Universality, Composition Generalization, and Algorithm Emulation All In-Context
Jerry Yao-Chieh Hu, Hong-Yu Chen, Po-Chiao Lin, Maojiang Su, Han Liu · PDF
Uplifting Human Decision Making in AI Evaluation by Automating Benchmark Validity Analysis
Rodolfo Corona, Sang T. Truong, Ritwik Gupta, Nhi Ngoc Truong, Atnafu Lambebo Tonja, Mena Attia, Fahim Faisal, Kaushal Kumar Maurya, Fred Philippy, Belu Ticona, Sumaya Nur Adan, Fazl Barez, Omar Florez, Supheakmungkol Sarin, Aseem Srivastava, Xiaoyuan Yi, Nick Haber, Dan Klein, Thamar Solorio, Xing Xie, Sanmi Koyejo, Robert Trager · PDF
When Agreement Becomes Unsafe: Loss-Aware Energy Control for Diagnostic Deliberation
Yuting Yan, Yinghao Fu, Haozhou Gao, Tianjian Zhang, Aoxi Liu, Shuang Li · PDF
When Does Polynomial Attention Concentrate? A Relative-Margin Diagnostic for Zero-Shot Softmax Substitution
Sanny Kim · PDF
Where’s the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
Nicole H. Ma, Nick Rui · PDF
YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He, Vincent Tu, Adit Jain, Anand Kumar, Sachin Patro, Soumyadeep Bakshi, Nazneen Rajani · PDF
You're reading LLM leaderboards wrong: Disentangling models from pipelines in engineering benchmarks
Marius Tacke, Shivam Suri, Matthias Busch, Mahish K. Guru, Christian J Cyron, Roland Aydin · PDF

Accepted papers (111)

☆A Benchmarked Diagnostic for Sparse Decomposability of Dense Causal Subspaces

☆A Cognitive Battery for Foundation Models: Theory-Grounded Benchmarks for Attention, Learning, Metacognition, Executive Function, and Social Cognition

☆A Controlled Benchmark for Lag-Structured Dependency Motifs

☆A Numerical Study of Robustness Verification for Lightning Self-Attention

☆A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation

☆Active probabilistic reasoning in humans and LLMs

☆Aggregate Metrics Hide Shortcut Regimes: A Complexity-Stratified Benchmark for Novel View Synthesis

☆AIE-Bench: Benchmarking Agents That Build Agents

☆AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs

☆BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

☆Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

☆Benchmark Scores Rank Methods, Not Capabilities: Theory, Evidence, and Protocols for the Saturation-Collapse Cycle

☆Benchmarks Are Not Atomic: Composition-Aware LLM Evaluation using BenchHub

☆Beyond Answer Correctness: Measuring and Reducing Explanation Faithfulness Gaps in Chart Understanding VLMs

☆Bounding Compositional Incoherence in Foundation Models

☆Capacity-Gated Forgetting in LoRA Fine-Tuning: Rank, Proximity, and Endogenous Replay in Medical LLMs

☆CellARC: An Oracle-Calibrated Benchmark for Few-Shot Rule Induction

☆Certifiable Evaluation: A Low-Rank Framework for Foundation Model Benchmarking with Formal Performance Guarantees

☆Certified Evaluation for LLMs in Optimization Modeling: From Graph Isomorphism to Formulation Isomorphism

☆Choosing Training-Time Calibration Objectives for Frozen Foundation-Model Features: A Linear-Probing Benchmark

☆CLIP Models Generalize Less Than Compositional Benchmarks Suggest

☆Collaborative Adaptive Labeling with Imperfect Labelers and Selective Expert Escalation

☆Combining Theory and Benchmarks for Length Generalisation: Formal Certificates Meet Large-Scale Evaluation

☆Conformalized Scaling Laws: Distribution-Free Prediction Intervals for Out-of-Distribution Compute Regimes

☆Constructing Thunder Korean Benchmark Suite for Reliable Evaluation of Foundation Models

☆Context Over Content: Exposing Evaluation Faking in Automated Judges

☆Context Saturation in Zero-Shot Time-Series Foundation Models

☆Contextual Observability and Grammar Singularity for Compositional Task Families

☆ContinuityBench: A Framework and Taxonomy for Evaluating Agent Recovery from Interrupted State

☆Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

☆Correcting Optimizer Selection Bias via Large Deviation Hazards

☆Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering

☆Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension

☆Cross-Language Evaluation of Prompt Inversion: Similarity Metrics, Decoding Strategies, and Prefix Sensitivity in Japanese and English

☆DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs

☆Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

☆EditCLEVR: A Paired-Scene Intervention Benchmark for Compositional Faithfulness of Object-Centric Representations

☆Efficient Safety Benchmarking via Item Response Theory

☆Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks

☆Estimating Pass@$k$ from Fewer Samples with Hierarchical Bayesian Priors

☆Evaluating LLM Reasoning on Operating System Algorithms via Step-Level Verification

☆Evaluator Failure Modes in Agentic Uncertainty Quantification

☆Executable Ground Truth: A Closed-Loop Benchmark for Evaluating LLM Agents on Microservice Incident Remediation

☆Fast Inference via Hierarchical Speculative Decoding

☆Feedforward Mixing is as Sharp as it is Slow in Reverse

☆FormalImG: Evaluating Structural Compositional Generalization for T2I Models

☆FRAME: Framework for Robotic Action and Motion Evaluation

☆From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA

☆From Forecast Scores to Auditable Benchmarks: WorldFork for LLM Forecasting Evaluation

☆From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for VLM Weak Supervision Across Three Medical-Imaging Benchmarks

☆Functional Subspace, where language models can use vector algebra to solve problems

☆Fuzzy-Clustered Mixture-of-Experts with Relational Regularization for Interpretable Subgroup Modeling under Data Scarcity

☆GapPO: Gradient-Adaptive Pairwise Preference Optimization

☆Generalized Priority-Aware Shapley Value

☆Generative vs Discriminative? Revisiting the shortcut learning debate in text classification

☆GraphStateEval: A Step-by-Step Evaluation Framework for Graph Algorithm Execution in Large Language Models via Intermediate State Tracing

☆Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench

☆Hidden Sensitivity in Spatial Reasoning Evaluation: Diagnosis and Re-ranking with VSI-Bench

☆Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

☆How good is your harness?

☆How long is a piece of string? A brief empirical analysis of tokenizers

☆Identifying Efficient Queries for Black-Box Model Classification

☆Instance-Optimal Estimation with Multiple LLM Judges on a Budget

☆Instruction Bleed: A Theory-Anchored Benchmark for Cross-Module Interference in Prompt-Composed Agents

☆Interactive Evaluation Requires a Design Science

☆Internal Data Repetition Destroys Language Models

☆Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

☆LoopNav: Benchmarking Spatial Consistency in World Models

☆m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

☆Measuring the Limits of Continual Learning for LLMs

☆Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

☆MultiVulnBench: A Large-Scale Benchmark for Count Bias in LLM-Based Multi-Vulnerability Detection

☆Null-Calibrated Evaluation of Sparse Autoencoder Decoder Reproducibility

☆On Cost-Effective LLM-as-a-Judge Improvement Techniques

☆On the Rotation-Equivariance Geometry of Tabular Foundation Models

☆Operads for compositional reasoning in LLMs

☆Perplexity Cannot Always Tell Right from Wrong

☆Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

☆Probabilistic Chain-of-Thought: Sequential Bayesian Inference over Latent Reasoning Correctness

A Benchmarked Diagnostic for Sparse Decomposability of Dense Causal Subspaces

A Cognitive Battery for Foundation Models: Theory-Grounded Benchmarks for Attention, Learning, Metacognition, Executive Function, and Social Cognition

A Controlled Benchmark for Lag-Structured Dependency Motifs

A Numerical Study of Robustness Verification for Lightning Self-Attention

A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation

Active probabilistic reasoning in humans and LLMs

Aggregate Metrics Hide Shortcut Regimes: A Complexity-Stratified Benchmark for Novel View Synthesis

AIE-Bench: Benchmarking Agents That Build Agents

AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs

BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

Benchmark Scores Rank Methods, Not Capabilities: Theory, Evidence, and Protocols for the Saturation-Collapse Cycle

Benchmarks Are Not Atomic: Composition-Aware LLM Evaluation using BenchHub

Beyond Answer Correctness: Measuring and Reducing Explanation Faithfulness Gaps in Chart Understanding VLMs

Bounding Compositional Incoherence in Foundation Models

Capacity-Gated Forgetting in LoRA Fine-Tuning: Rank, Proximity, and Endogenous Replay in Medical LLMs

CellARC: An Oracle-Calibrated Benchmark for Few-Shot Rule Induction

Certifiable Evaluation: A Low-Rank Framework for Foundation Model Benchmarking with Formal Performance Guarantees

Certified Evaluation for LLMs in Optimization Modeling: From Graph Isomorphism to Formulation Isomorphism

Choosing Training-Time Calibration Objectives for Frozen Foundation-Model Features: A Linear-Probing Benchmark

CLIP Models Generalize Less Than Compositional Benchmarks Suggest

Collaborative Adaptive Labeling with Imperfect Labelers and Selective Expert Escalation

Combining Theory and Benchmarks for Length Generalisation: Formal Certificates Meet Large-Scale Evaluation

Conformalized Scaling Laws: Distribution-Free Prediction Intervals for Out-of-Distribution Compute Regimes

Constructing Thunder Korean Benchmark Suite for Reliable Evaluation of Foundation Models

Context Over Content: Exposing Evaluation Faking in Automated Judges

Context Saturation in Zero-Shot Time-Series Foundation Models

Contextual Observability and Grammar Singularity for Compositional Task Families

ContinuityBench: A Framework and Taxonomy for Evaluating Agent Recovery from Interrupted State

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

Correcting Optimizer Selection Bias via Large Deviation Hazards

Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering

Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension

Cross-Language Evaluation of Prompt Inversion: Similarity Metrics, Decoding Strategies, and Prefix Sensitivity in Japanese and English

DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

EditCLEVR: A Paired-Scene Intervention Benchmark for Compositional Faithfulness of Object-Centric Representations

Efficient Safety Benchmarking via Item Response Theory

Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks

Estimating Pass@$k$ from Fewer Samples with Hierarchical Bayesian Priors

Evaluating LLM Reasoning on Operating System Algorithms via Step-Level Verification

Evaluator Failure Modes in Agentic Uncertainty Quantification

Executable Ground Truth: A Closed-Loop Benchmark for Evaluating LLM Agents on Microservice Incident Remediation

Fast Inference via Hierarchical Speculative Decoding

Feedforward Mixing is as Sharp as it is Slow in Reverse

FormalImG: Evaluating Structural Compositional Generalization for T2I Models

FRAME: Framework for Robotic Action and Motion Evaluation

From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA

From Forecast Scores to Auditable Benchmarks: WorldFork for LLM Forecasting Evaluation

From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for VLM Weak Supervision Across Three Medical-Imaging Benchmarks

Functional Subspace, where language models can use vector algebra to solve problems

Fuzzy-Clustered Mixture-of-Experts with Relational Regularization for Interpretable Subgroup Modeling under Data Scarcity

GapPO: Gradient-Adaptive Pairwise Preference Optimization

Generalized Priority-Aware Shapley Value

Generative vs Discriminative? Revisiting the shortcut learning debate in text classification

GraphStateEval: A Step-by-Step Evaluation Framework for Graph Algorithm Execution in Large Language Models via Intermediate State Tracing

Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench

Hidden Sensitivity in Spatial Reasoning Evaluation: Diagnosis and Re-ranking with VSI-Bench

Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

How good is your harness?

How long is a piece of string? A brief empirical analysis of tokenizers

Identifying Efficient Queries for Black-Box Model Classification

Instance-Optimal Estimation with Multiple LLM Judges on a Budget

Instruction Bleed: A Theory-Anchored Benchmark for Cross-Module Interference in Prompt-Composed Agents

Interactive Evaluation Requires a Design Science

Internal Data Repetition Destroys Language Models

Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering

LoopNav: Benchmarking Spatial Consistency in World Models

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

Measuring the Limits of Continual Learning for LLMs

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

MultiVulnBench: A Large-Scale Benchmark for Count Bias in LLM-Based Multi-Vulnerability Detection

Null-Calibrated Evaluation of Sparse Autoencoder Decoder Reproducibility

On Cost-Effective LLM-as-a-Judge Improvement Techniques

On the Rotation-Equivariance Geometry of Tabular Foundation Models

Operads for compositional reasoning in LLMs

Perplexity Cannot Always Tell Right from Wrong

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

Probabilistic Chain-of-Thought: Sequential Bayesian Inference over Latent Reasoning Correctness

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems