ICML 2026 Past Large language modelsTheoryEvaluation & benchmarks
ICML 2026 Workshop on Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance
CTB@ICML 2026
- Submission deadline
- May 8, 2026, 23:59 AoE (UTC−12) imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (111)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Benchmarked Diagnostic for Sparse Decomposability of Dense Causal Subspaces
-
A Cognitive Battery for Foundation Models: Theory-Grounded Benchmarks for Attention, Learning, Metacognition, Executive Function, and Social Cognition
-
A Controlled Benchmark for Lag-Structured Dependency Motifs
-
A Numerical Study of Robustness Verification for Lightning Self-Attention
-
A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation
-
Active probabilistic reasoning in humans and LLMs
-
Aggregate Metrics Hide Shortcut Regimes: A Complexity-Stratified Benchmark for Novel View Synthesis
-
AIE-Bench: Benchmarking Agents That Build Agents
-
AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs
-
BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding
-
Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
-
Benchmark Scores Rank Methods, Not Capabilities: Theory, Evidence, and Protocols for the Saturation-Collapse Cycle
-
Benchmarks Are Not Atomic: Composition-Aware LLM Evaluation using BenchHub
-
Beyond Answer Correctness: Measuring and Reducing Explanation Faithfulness Gaps in Chart Understanding VLMs
-
Bounding Compositional Incoherence in Foundation Models
-
Capacity-Gated Forgetting in LoRA Fine-Tuning: Rank, Proximity, and Endogenous Replay in Medical LLMs
-
CellARC: An Oracle-Calibrated Benchmark for Few-Shot Rule Induction
-
Certifiable Evaluation: A Low-Rank Framework for Foundation Model Benchmarking with Formal Performance Guarantees
-
Certified Evaluation for LLMs in Optimization Modeling: From Graph Isomorphism to Formulation Isomorphism
-
Choosing Training-Time Calibration Objectives for Frozen Foundation-Model Features: A Linear-Probing Benchmark
-
CLIP Models Generalize Less Than Compositional Benchmarks Suggest
-
Collaborative Adaptive Labeling with Imperfect Labelers and Selective Expert Escalation
-
Combining Theory and Benchmarks for Length Generalisation: Formal Certificates Meet Large-Scale Evaluation
-
Conformalized Scaling Laws: Distribution-Free Prediction Intervals for Out-of-Distribution Compute Regimes
-
Constructing Thunder Korean Benchmark Suite for Reliable Evaluation of Foundation Models
-
Context Over Content: Exposing Evaluation Faking in Automated Judges
-
Context Saturation in Zero-Shot Time-Series Foundation Models
-
Contextual Observability and Grammar Singularity for Compositional Task Families
-
ContinuityBench: A Framework and Taxonomy for Evaluating Agent Recovery from Interrupted State
-
Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB
-
Correcting Optimizer Selection Bias via Large Deviation Hazards
-
Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering
-
Cracks in the Foundation: Seemingly Minor Architectural Choices Impact Long Context Extension
-
Cross-Language Evaluation of Prompt Inversion: Similarity Metrics, Decoding Strategies, and Prefix Sensitivity in Japanese and English
-
DeflectBench: A Benchmark for Evaluating Rhetorical Fallacy Generation in LLMs
-
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
-
EditCLEVR: A Paired-Scene Intervention Benchmark for Compositional Faithfulness of Object-Centric Representations
-
Efficient Safety Benchmarking via Item Response Theory
-
Ensuring Calibration Robustness in Split Conformal Prediction Under Adversarial Attacks
-
Estimating Pass@$k$ from Fewer Samples with Hierarchical Bayesian Priors
-
Evaluating LLM Reasoning on Operating System Algorithms via Step-Level Verification
-
Evaluator Failure Modes in Agentic Uncertainty Quantification
-
Executable Ground Truth: A Closed-Loop Benchmark for Evaluating LLM Agents on Microservice Incident Remediation
-
Fast Inference via Hierarchical Speculative Decoding
-
Feedforward Mixing is as Sharp as it is Slow in Reverse
-
FormalImG: Evaluating Structural Compositional Generalization for T2I Models
-
FRAME: Framework for Robotic Action and Motion Evaluation
-
From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
-
From Forecast Scores to Auditable Benchmarks: WorldFork for LLM Forecasting Evaluation
-
From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for VLM Weak Supervision Across Three Medical-Imaging Benchmarks
-
Functional Subspace, where language models can use vector algebra to solve problems
-
Fuzzy-Clustered Mixture-of-Experts with Relational Regularization for Interpretable Subgroup Modeling under Data Scarcity
-
GapPO: Gradient-Adaptive Pairwise Preference Optimization
-
Generalized Priority-Aware Shapley Value
-
Generative vs Discriminative? Revisiting the shortcut learning debate in text classification
-
GraphStateEval: A Step-by-Step Evaluation Framework for Graph Algorithm Execution in Large Language Models via Intermediate State Tracing
-
Hidden in Plain Sight: Benchmarking Agent Safety Against Decomposition Attacks with DecompBench
-
Hidden Sensitivity in Spatial Reasoning Evaluation: Diagnosis and Re-ranking with VSI-Bench
-
Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots
-
How good is your harness?
-
How long is a piece of string? A brief empirical analysis of tokenizers
-
Identifying Efficient Queries for Black-Box Model Classification
-
Instance-Optimal Estimation with Multiple LLM Judges on a Budget
-
Instruction Bleed: A Theory-Anchored Benchmark for Cross-Module Interference in Prompt-Composed Agents
-
Interactive Evaluation Requires a Design Science
-
Internal Data Repetition Destroys Language Models
-
Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering
-
LoopNav: Benchmarking Spatial Consistency in World Models
-
m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning
-
Measuring the Limits of Continual Learning for LLMs
-
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
-
MultiVulnBench: A Large-Scale Benchmark for Count Bias in LLM-Based Multi-Vulnerability Detection
-
Null-Calibrated Evaluation of Sparse Autoencoder Decoder Reproducibility
-
On Cost-Effective LLM-as-a-Judge Improvement Techniques
-
On the Rotation-Equivariance Geometry of Tabular Foundation Models
-
Operads for compositional reasoning in LLMs
-
Perplexity Cannot Always Tell Right from Wrong
-
Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit
-
Probabilistic Chain-of-Thought: Sequential Bayesian Inference over Latent Reasoning Correctness
-
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
-
PromptSplit: Revealing Prompt-Level Disagreement in Generative Models
-
Quantifying Empirical Compute-Supervision Tradeoffs in RLVR
-
Rethinking FID Through the Geometry of the Reference Dataset
-
Rethinking LLM Confidence: From Calibration to Coherence
-
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
-
Retrieval Dwelling: A Principled Sampling Strategy for Exploiting Spurious State Exploration
-
SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks
-
Scale Dependent Data Duplication
-
Selective Perturbations as a Diagnostic for Benchmark-Based LLM Comparisons
-
SemanticSRJudge: Spatially-Grounded VLM Evaluation for Super-Resolution Quality Assessment
-
ShiftBench: A Benchmark for Per-Cohort Certify-or-Abstain Decisions on Positive Predictive Value Under Covariate Shift
-
Simulating Field Experiments for Method Testing
-
Spectral Signatures of Large Language Models
-
Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models
-
Stress-Testing Neural Network Verifiers with Provably Robust Instances
-
Style Conventions Override Performance Predictions in Coding LLMs
-
Symmetries of Functional Processes under Label Noise
-
Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection
-
The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling
-
The Propagation Field: A Geometric Substrate Theory of Deep Learning
-
The Shape of Noise: Layer-Wise Perturbation Profiles for Diagnosing Vision Robustness
-
Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization
-
Toward Trustworthy LLM–GNN Fusion: A Fusion-Aware Evaluation and Reporting Framework
-
Trace-Aware Routing for Cost-Effective Human–AI Collaborative Labeling
-
Universality, Composition Generalization, and Algorithm Emulation All In-Context
-
Uplifting Human Decision Making in AI Evaluation by Automating Benchmark Validity Analysis
-
When Agreement Becomes Unsafe: Loss-Aware Energy Control for Diagnostic Deliberation
-
When Does Polynomial Attention Concentrate? A Relative-Margin Diagnostic for Zero-Shot Softmax Substitution
-
Where’s the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
-
YC-Bench: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
-
You're reading LLM leaderboards wrong: Disentangling models from pipelines in engineering benchmarks