NeurIPS 2025 Past Large language modelsEvaluation & benchmarks
NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
NeurIPS 2025 LLM Evaluation Workshop
- Submission deadline
- Sep 5, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (186)
Fetched from OpenReview (v2) on 2026-06-10.
-
"It Doesn’t Know Anything About my Work": Participatory Benchmarking and AI Evaluation in Applied Settings
-
A Benchmark for Description-Based Evaluation of Social Bias in LLMs
-
A Case for Centaur Evaluations
-
A Multi-Aspect Evaluation of Dialogue in Pythia
-
A Protocol-Driven Platform for Agent-Agnostic Evaluation of LLM Agents
-
A Statistical Framework for Game-Based AI Evaluation
-
A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs
-
Active Model Selection for Large Language Models
-
ADCA: Artifact-Based Dataset Creativity Assessment
-
Adversarial Behavior in Research Settings: Conducting Sabotage Evaluations with RE-Bench
-
AgentCaster: Reasoning-Guided Tornado Forecasting
-
Agentic Lean Auformalization (ALA) v1: An LLM collaborative approach to autoformalization in LEAN
-
An Evaluation Study of Hybrid Methods for Multilingual PII Detection
-
Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction
-
ASCII-Bench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text
-
AssertBench: A Benchmark for LLM Resistance to User-Induced Factual Bias
-
Attention, Please: Single-Head Cross-Attention for Unified LLM Routing
-
Automated Capability Evaluation of Foundation Models
-
Automatic agent chaining for multimodal task support
-
Automatically Extracting Scientific Metrics with LLMs: A Case Study of ImageNet Papers
-
Bayesian Evaluation of Blackbox LLM Behavior
-
BEAR: Benchmarking Multimodal Language Models for Atomic Embodied Reasoning Abilities
-
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
-
Benchmarking and Standardization of Evaluation Protocols: A Feedback-Driven Framework Using LLM Judges to Gatekeep and Iteratively Improve Synthetic Benchmarks
-
Benchmarking Overton Pluralism in LLMs
-
Beyond Accuracy: A Diagnostic Protocol for Fairly Evaluating Multimodal Reasoning
-
Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation
-
Beyond Steering: Evaluating Fine-Grained and Multi-Concept Control in LLMs
-
Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation
-
Beyond Western Politics: Cross-Cultural Benchmarks for Evaluating Partisan Associations in LLMs
-
Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment
-
BloomXplain: A Framework and Benchmark Dataset for Pedagogically Sound LLM-Generated Explanations Based on Bloom’s Taxonomy
-
Born with a SilverSpoon? Investigating Socioeconomic Bias in LLMs
-
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
-
Breaking the Mirror: Examining Self-Preference in LLM Evaluators through Activation-Based Representations
-
Building More Accountable Multi-Modal LLMs Through Spatially-Informed Visual Reasoning
-
Carbon- and System-Aware LoRA Scaling for On-Device LLMs via Hierarchical Multi-Objective Reinforcement Learnin
-
Causally Quantifying the Effect of Test Set Contamination on Generative Benchmarks
-
CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
-
CCWise: Carbon–Cost Aware Regional LLM Orchestration for Next-Gen Sustainable AI
-
ChatChecker: A Framework for Dialogue System Testing Through Non-cooperative User Simulation
-
ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response Assistance
-
CHEMSETS: How Capable Are Chemistry LLMs?
-
ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning
-
CivicParse: A Benchmark and Pipeline for Structured Online Deliberation
-
Confident or Seek Stronger: Exploring Uncertainty-Based Small LM Routing From Benchmarking to Generalization
-
Context-Masked Meta-Prompting for Privacy-Preserving LLM Adaptation in Finance
-
Culturally-Aware Conversations: A Framework & Benchmark for LLMs
-
Data Centric Guard (DC-Guard) - A Framework for Trustworthy LLM Evaluation
-
DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
-
Demystify the Potential of Large Language Models as General-Purpose Surrogate Code Executors
-
Depth as a Scaling Vector: Simple Pruning and Evaluation of Emergent Abilities in Pruned LLMs
-
Detecting Data Contamination in LLMs via In-Context Learning
-
Detecting Foreign Content in Self-Generated Text: A Recognition Study of Large Language Models
-
Detecting Training Data of Large Language Models via Expectation Maximization
-
DHP Benchmark: Measuring Discernment Ability of LLM-as-a-Judge
-
Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base
-
Do Large Language Models Know What They Are Capable Of?
-
Domain-Aware Scaling Laws Uncover Data Synergy
-
DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors
-
Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?
-
Evaluating AI Alignment Using Adapted Clinical Empathy Assessments
-
Evaluating Cultural and Linguistic Alignment Across the LLMs
-
Evaluating Evaluation Metrics – The Mirage of Hallucination Detection
-
Evaluating Language Models' Evaluations of Games
-
Evaluating LLM Story Generation through Large-scale Network Analysis on Social Structures
-
Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints
-
Evaluating LLMs for Combinatorial Optimization: One-Phase and Two-Phase Heuristics for 2D Bin-Packing
-
Evaluating LLMs' Language Confusion in Code-switching Context
-
Evaluation and Benchmarking Suite for Financial Large Language Models and Agents
-
Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification
-
Extending AutoCompressors via Surprisal-Based Dynamic Segmentation
-
FEval-TTC: Fair Evaluation Protocol for Test-Time Compute
-
From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining
-
From Bias to Balance: How Multilingual Dataset Composition Affects Tokenizer Performance Across Languages
-
From Many Voices to One: Statistically Principled Aggregation of LLM Judges
-
GASLIGHTBENCH: Quantifying LLM Susceptibility to Social Prompting
-
Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution
-
GermanPartiesQA: Benchmarking Commercial Large Language Models and AI Companions for Political Alignment and Sycophancy
-
GUARD: Guiding Unbiased Alignment through Reward Debiasing
-
Haystack Engineering: Context Engineering Meets the Long-Context Challenge in Large Language Models
-
HORIZON: A Benchmark for In-the-wild User Behaviour Modeling
-
How Benchmark Prediction from Fewer Data Misses the Mark
-
How Many Instructions Can LLMs Follow at Once?
-
How to Get Your LLM to Generate Challenging Problems for Evaluation
-
Human-Centric Framework for Large Multimodal Models Evaluation
-
Husky Hold'em Benchmark: Can LLMs Design Competitive Poker Bots?
-
HypoTermInstruct: Instructing Large Language Models not to Hallucinate
-
Improving Automated LLM Evaluation by Introducing Personas in LLM Red-Teaming
-
In-Context Learning for Esoteric Programming Languages: Evaluating and Enhancing LLM Reasoning Without Fine-Tuning
-
In-Context Meta-Learning with Large Language Models for Automated Model and Hyperparameter Selection
-
JOINTMMSAFE: A Combinatorial Safety Benchmark for Multimodal Foundation Models
-
Justice in Judgment: Unveiling (Hidden) Bias in LLM-Assisted Peer Reviews
-
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale
-
Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training
-
LaTeXBench: Judge-Only Evaluation of LaTeX Generation, Minimal-Edit Compliance, and Blind Contrast Errors
-
Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models
-
LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation
-
LLMs as Judges for Domain-Specific Text: Evidence from Drilling Reports
-
LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests
-
LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives
-
MAGNET: Mathematical Assurance of Generative AI Network Evaluation Toolkit
-
MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains
-
MEAL: A Multi-dimensional Evaluation of Alignment Techniques for LLMs
-
Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
-
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use
-
Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation
-
MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation
-
Metrics for Holistic Evaluation of LLM Reasoning about Action, Change, and Planning
-
Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs
-
MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment
-
Mitigating Self-Preference by Authorship Obfuscation
-
MonitorLLM: Real-Time Structural and Bias Evaluation of Generative AI through Knowledge Graphs
-
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
-
Narrow RL Induces Broad Behavior Changes in LLMs
-
Network Dynamics Reasoning: A Novel Benchmark for Evaluating Multi-Step Inference in Large Language Models
-
No Question, No Passage, No Problem: Investigating Artifact Exploitation and Reasoning in Multiple-Choice Reading Comprehension
-
No-Human in the Loop: Agentic Evaluation at Scale for Recommendation
-
On Evaluating Methods vs. Evaluating Models
-
OpenGovCorpus: Evaluating LLMs on Citizen Query Tasks
-
OPTiCAL: An Abstract Positional Reasoning Benchmark for Vision Language Models
-
Paraphrasing Away Malicious Tokens: Improving LLM Finetuning Safety by Filtering Spurious Correlation
-
PEBBLE: A Pedagogical and SRL-Aware Benchmark for Evaluating LLM Tutors
-
Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects
-
Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025
-
PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
-
Precision Shapes Personality: The Hidden Cost of Quantization in Sub-Billion-LLMs
-
Precursors, Proxies, and Predictive Models for Long-Horizon Tasks
-
Predicting Emergent Software Engineering Capabilities by Fine-tuning
-
Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
-
Probing Reasoning Flaws and Safety Hierarchies with Chain-of-Thought Difference Amplification
-
Progress over Points: Reframing LM Benchmarks Around Scientific Objectives
-
Prompt Genotyping: Quantifying the Evaluation Gap Between Synthetic Benchmarks and Real LLM Performance
-
PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning
-
R3: Robust Rubric-Agnostic Reward Models
-
RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation
-
Recovery-Bench: Evaluating Agentic Recovery from Mistakes
-
Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check
-
RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
-
RELIC: Evaluating Compositional Instruction Following via Language Recognition
-
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
-
Rethinking Kernel Program Repair: Benchmarking and Enhancing LLMs with RGym
-
Rethinking MCQ Benchmarks: Mandatory Reasoning Evaluation Reveals Significant Performance Drops in Large Language Models
-
Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs
-
Reward Model Overoptimisation in Iterated RLHF
-
RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning
-
RULERv2: From Basic Retrieval to Complex Reasoning, A Bottom-Up Benchmark for Long-Context Evaluation
-
SAGE: A Realistic Benchmark for Semantic Understanding
-
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
-
Scaling Laws for Upcycling Mixture-of-Experts Language Models
-
Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks
-
Search-Time Data Contamination
-
Self-Correction Bench: Revealing the Self-Correction Blind Spot in LLMs
-
Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
-
Silent Tokens, Loud Effects: Padding in LLMs
-
Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts
-
Smarter Sampling for LLM Judges: Reliable Evaluation on a Budget
-
SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code
-
Sycophancy Claims about Language Models: The Missing Human-in-the-Loop
-
T-FIX: Text-Based Explanations with Features Interpretable to eXperts
-
Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLMs
-
The Contamination Paradox: Why Test Set Leakage Can Be Both Potent and Negligible
-
The Impact of Post-training on Data Contamination
-
The Measure of All Measures: Quantifying LLM Benchmark Quality
-
The Narcissus Hypothesis: Descending to the Rung of Illusion
-
The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation
-
The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference
-
The Shepherd Test: How Will SuperIntelligent Agents Balance Care and Control in Asymmetric Relationships?
-
The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation
-
Towards Dynamic KV-Cache Compression: Fine-Grained Evaluation of Key and Value Ranks in LLMs
-
Towards Multilingual Mechanistic Interpretability
-
Towards Real-World Evaluation of Agentic Work in Freelance Marketplaces
-
Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?
-
Train-before-Test Harmonizes Language Model Rankings
-
TrolleyBench: Evaluating Emergent Moral Reasoning and Consistency in LLMs
-
Uncertainty Quantification for Language Models: Standardizing and Evaluating Black-Box, White-Box, LLM Judge, and Ensemble Scorers
-
UQ: Assessing Language Models on Unsolved Questions
-
VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT
-
When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation
-
When LLM Meets Time Series: Can LLMs Perform Multistep Time Series Reasoning and Inference
-
Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution
-
Who Routes the Router: Rethinking the Evaluation of LLM Routing Systems
-
Who’s the Impostor? Multi‑Agent Social Deduction for Evaluating LLM Social Reasoning
-
Whose Personae? Synthetic Persona Experiments in LLM Research and Pathways to Transparency
-
Why Do Multi-Agent LLM Systems Fail?
-
YKSBench: Stress-Testing Multimodal Models with Exam-Style Questions