NeurIPS 2025 Past Large language modelsEvaluation & benchmarks

NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

NeurIPS 2025 LLM Evaluation Workshop

Submission deadline
Sep 5, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (186)

Fetched from OpenReview (v2) on 2026-06-10.

  1. "It Doesn’t Know Anything About my Work": Participatory Benchmarking and AI Evaluation in Applied Settings

    Elizabeth Anne Watkins, Emanuel Moss, Ramesh Manuvinakurike, Christopher Persaud, Giuseppe Raffa, Lama Nachman · PDF
  2. A Benchmark for Description-Based Evaluation of Social Bias in LLMs

    Jinhao Pan, Kyle Li, Bowen Wei, Ziwei Zhu · PDF
  3. A Case for Centaur Evaluations

    Andreas Haupt, Erik Brynjolfsson · PDF
  4. A Multi-Aspect Evaluation of Dialogue in Pythia

    Zixun Chen, Petr Babkin, Akshat Gupta, Gopala Anumanchipalli, Xiaomo Liu · PDF
  5. A Protocol-Driven Platform for Agent-Agnostic Evaluation of LLM Agents

    Cong Minh Tran, Issam Falih, Hatim CHAHDI, Romain DE LA SOUCHERE · PDF
  6. A Statistical Framework for Game-Based AI Evaluation

    Felipe Maia Polo, Leshem Choshen, Yuekai Sun, Kristjan Greenewald · PDF
  7. A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

    Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki · PDF
  8. Active Model Selection for Large Language Models

    Yavuz Durmazkeser, Patrik Okanovic, Andreas Kirsch, Torsten Hoefler, Nezihe Merve Gürel · PDF
  9. ADCA: Artifact-Based Dataset Creativity Assessment

    Harrison Sims, Gabriel Ganberg, Robert McCormack, Svitlana Volkova · PDF
  10. Adversarial Behavior in Research Settings: Conducting Sabotage Evaluations with RE-Bench

    Harini Rajakumar, Vanessa Nwauwa, Kevin Zhu, Ashwinee Panda, Sunishchal Dev · PDF
  11. AgentCaster: Reasoning-Guided Tornado Forecasting

    Michael Chen · PDF
  12. Agentic Lean Auformalization (ALA) v1: An LLM collaborative approach to autoformalization in LEAN

    Patricio Gallardo, Maziar Raissi, Ke Zhang, Sudhir Murthy · PDF
  13. An Evaluation Study of Hybrid Methods for Multilingual PII Detection

    Harshit Rajgarhia, Suryam Gupta, Asif Shaik, Gulipalli Praveen Kumar, Y Santhoshraj, Sanka Nithya Tanvy Nishitha, Abhishek Mukherji · PDF
  14. Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

    Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, Jian Kang · PDF
  15. ASCII-Bench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

    Kerry Luo, Joshua Peguero, Anvay Patil, Megan Van Overborg, Ryan Sarmiento, Kevin Zhu · PDF
  16. AssertBench: A Benchmark for LLM Resistance to User-Induced Factual Bias

    Jaeho Lee, Atharv Chowdhary · PDF
  17. Attention, Please: Single-Head Cross-Attention for Unified LLM Routing

    Roshini Pulishetty, Mani Kishan Ghantasala, Keerthy Kaushik Dasoju, Niti Mangwani, Vishal Garimella, Aditya Mate, Somya Chatterjee, Yue Kang, Ehi Nosakhare, Sadid A. Hasan, Soundararajan Srinivasan · PDF
  18. Automated Capability Evaluation of Foundation Models

    Arash Afkanpour, Omkar Dige, Fatemeh Tavakoli · PDF
  19. Automatic agent chaining for multimodal task support

    Ramesh Manuvinakurike, Celal Savur, Emanuel Moss, Elizabeth Anne Watkins, Saurav Sahay, Giuseppe Raffa · PDF
  20. Automatically Extracting Scientific Metrics with LLMs: A Case Study of ImageNet Papers

    Mengli Duan, Michael Guerzhoy · PDF
  21. Bayesian Evaluation of Blackbox LLM Behavior

    Rachel Longjohn, Shang Wu, Saatvik Kher, Catarina G Belém, Padhraic Smyth · PDF
  22. BEAR: Benchmarking Multimodal Language Models for Atomic Embodied Reasoning Abilities

    Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, jiazheng liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L.S. Wong · PDF
  23. Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

    Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen · PDF
  24. Benchmarking and Standardization of Evaluation Protocols: A Feedback-Driven Framework Using LLM Judges to Gatekeep and Iteratively Improve Synthetic Benchmarks

    FadillAmir · PDF
  25. Benchmarking Overton Pluralism in LLMs

    Elinor Poole-Dayan, Jiayi Wu, Jiaxin Pei, Michiel A. Bakker · PDF
  26. Beyond Accuracy: A Diagnostic Protocol for Fairly Evaluating Multimodal Reasoning

    Shohreh Ghorbani, Chenyu Zhang, Minsol Kim, Jingyao Wu · PDF
  27. Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation

    Mir Tafseer Nayeem, Sawsan Alqahtani, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari · PDF
  28. Beyond Steering: Evaluating Fine-Grained and Multi-Concept Control in LLMs

    Arya Labroo, Ivaxi Sheth, Vyas Raina, Amaani Ahmed, Mario Fritz · PDF
  29. Beyond the Singular: Revealing the Value of Multiple Generations in Benchmark Evaluation

    Wenbo Zhang, Hengrui Cai, Wenyu Chen · PDF
  30. Beyond Western Politics: Cross-Cultural Benchmarks for Evaluating Partisan Associations in LLMs

    Divyanshu Kumar, Ishita Gupta, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi · PDF
  31. Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

    Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza · PDF
  32. BloomXplain: A Framework and Benchmark Dataset for Pedagogically Sound LLM-Generated Explanations Based on Bloom’s Taxonomy

    Maria-Eleni Zoumpoulidi, Eleni Batsi, Georgios Paraskevopoulos, Vassilis Katsouros, Alexandros Potamianos · PDF
  33. Born with a SilverSpoon? Investigating Socioeconomic Bias in LLMs

    Smriti Singh, Shuvam Keshari, Vinija Jain, Aman Chadha · PDF
  34. Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

    Bo Feng, Zhengfeng Lai, Shiyu Li, Zizhen Wang, Xiaoming Simon Wang, Ping Huang, Meng Cao · PDF
  35. Breaking the Mirror: Examining Self-Preference in LLM Evaluators through Activation-Based Representations

    Dani Roytburg, Matthew Bozoukov, Hongyu Fu, Matthew Nguyen, Jou Barzdukas, Narmeen Fatimah Oozeer · PDF
  36. Building More Accountable Multi-Modal LLMs Through Spatially-Informed Visual Reasoning

    Jing Wu, Suiyao Chen, Alexander Gutfraind, Inseok Heo, Shengjie Liu, Chen Li, Jeremy Curuksu, Michael Sharps · PDF
  37. Carbon- and System-Aware LoRA Scaling for On-Device LLMs via Hierarchical Multi-Objective Reinforcement Learnin

    Dongqi Zheng, Wenjin Fu · PDF
  38. Causally Quantifying the Effect of Test Set Contamination on Generative Benchmarks

    Rylan Schaeffer, Brando Miranda, Joshua Kazdan, Ken Liu, Ahmed M Ahmed, Niloofar Mireshghallah, Sanmi Koyejo · PDF
  39. CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments

    Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, Antoine Bosselut · PDF
  40. CCWise: Carbon–Cost Aware Regional LLM Orchestration for Next-Gen Sustainable AI

    Ratul Kishore Saha, Dheeraj Chahal, Rekha Singhal, Manoj Nambiar · PDF
  41. ChatChecker: A Framework for Dialogue System Testing Through Non-cooperative User Simulation

    Roman Mayr, Michel Schimpf, Thomas Bohné · PDF
  42. ChEmREF: Evaluating Language Model Readiness for Chemical Emergency Response Assistance

    Risha Surana, Qinyuan Ye, Swabha Swayamdipta · PDF
  43. CHEMSETS: How Capable Are Chemistry LLMs?

    Christoph Bartmann, Mykyta Ielanskyi, Johannes Schimunek, Philipp Seidl, Günter Klambauer, Sohvi Luukkonen · PDF
  44. ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning

    Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Siyu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-Feng Li · PDF
  45. CivicParse: A Benchmark and Pipeline for Structured Online Deliberation

    Abhay Gupta, Mark Klein · PDF
  46. Confident or Seek Stronger: Exploring Uncertainty-Based Small LM Routing From Benchmarking to Generalization

    Yu-Neng Chuang, Leisheng Yu, Guanchu Wang, Lizhe Zhang, Ling Chang, Hongyi Liu, Zirui Liu, Xuanting Cai, Yang Sui, Vladimir Braverman, Xia Hu · PDF
  47. Context-Masked Meta-Prompting for Privacy-Preserving LLM Adaptation in Finance

    Sayash Raaj Hiraou · PDF
  48. Culturally-Aware Conversations: A Framework & Benchmark for LLMs

    Shreya Havaldar, Young Min Cho, Sunny Rai, Lyle Ungar · PDF
  49. Data Centric Guard (DC-Guard) - A Framework for Trustworthy LLM Evaluation

    Vishnu Vardhan Yadoji · PDF
  50. DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

    Liana Patel, Negar Arabzadeh, Harshit Gupta, Ankita Sundar, Ion Stoica, Matei Zaharia, Carlos Guestrin · PDF
  51. Demystify the Potential of Large Language Models as General-Purpose Surrogate Code Executors

    Bohan Lyu, Siqiao Huang, Zichen Liang, Wenjia Yang, Qian Sun, Jiaming Zhang · PDF
  52. Depth as a Scaling Vector: Simple Pruning and Evaluation of Emergent Abilities in Pruned LLMs

    Chang Liu, Arjun Choudhry, Yifu Cai, Nina Żukowska, Mononito Goswami, Artur Dubrawski · PDF
  53. Detecting Data Contamination in LLMs via In-Context Learning

    Michał Zawalski, Meriem Boubdir, Klaudia Bałazy, Besmira Nushi, Pablo Ribalta · PDF
  54. Detecting Foreign Content in Self-Generated Text: A Recognition Study of Large Language Models

    Shengyu Zhu, Tamika Bassman, Dat Tran, Aryaman Arora · PDF
  55. Detecting Training Data of Large Language Models via Expectation Maximization

    Gyuwan Kim, Yang Li, Evangelia Spiliopoulou, Jie Ma, Miguel Ballesteros, William Yang Wang · PDF
  56. DHP Benchmark: Measuring Discernment Ability of LLM-as-a-Judge

    Jiayi Yuan, Yicheng Wang, Yu-Neng Chuang, Zhuoer Wang, Mark Cusick, Param Kulkarni, Zhengping Ji, Xia Hu · PDF
  57. Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

    Linxin Song, Xuwei Ding, Jieyu Zhang, Taiwei Shi, Ryotaro Shimizu, Rahul Gupta, Yang Liu, Jian Kang, Jieyu Zhao · PDF
  58. Do Large Language Models Know What They Are Capable Of?

    Casey O. Barkan, Sidney Black, Oliver Sourbut · PDF
  59. Domain-Aware Scaling Laws Uncover Data Synergy

    Kimia Hamidieh, Lester Mackey, David Alvarez-Melis · PDF
  60. DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

    Yize Cheng, Wenxiao Wang, Mazda Moayeri, Soheil Feizi · PDF
  61. Editing as Unlearning: Are Knowledge Editing Methods Strong Baselines for Large Language Model Unlearning?

    Zexi Li, Xiangzhu Wang, William F. Shen, Meghdad Kurmanji, Xinchi Qiu, Dongqi Cai, Chao Wu, Nicholas D. Lane · PDF
  62. Evaluating AI Alignment Using Adapted Clinical Empathy Assessments

    Cassandra Feilbach · PDF
  63. Evaluating Cultural and Linguistic Alignment Across the LLMs

    Yunxi Liu, Fuxiao Liu, Clara Fangfang Ma · PDF
  64. Evaluating Evaluation Metrics – The Mirage of Hallucination Detection

    Atharva Kulkarni, Yuan Zhang, Joel Ruben Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, Hong Yu · PDF
  65. Evaluating Language Models' Evaluations of Games

    Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths · PDF
  66. Evaluating LLM Story Generation through Large-scale Network Analysis on Social Structures

    Hiroshi Nonaka, K. E. Perry · PDF
  67. Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints

    Shreyansh Padarha, Elizaveta Semenova, Bertie Vidgen, Adam Mahdi, Scott A. Hale · PDF
  68. Evaluating LLMs for Combinatorial Optimization: One-Phase and Two-Phase Heuristics for 2D Bin-Packing

    Syed Mahbubul Huq, Daniel Brito-Pacheco, Daniel Sikar, RAJESH MOJUMDER, Christopher Child, Tillman Weyde · PDF
  69. Evaluating LLMs' Language Confusion in Code-switching Context

    Juhyun Oh, Haneul Yoo, Alice Oh · PDF
  70. Evaluation and Benchmarking Suite for Financial Large Language Models and Agents

    Shengyuan Lin, Jaisal Patel, Qinchuan Zhang, Kaiwen He, Keyi Wang, Yan Wang, Matt White, Kairong Xiao, Xiao-Yang Liu · PDF
  71. Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification

    Joseph Liu, Yoonsoo Nam, Xinyue Cui, Swabha Swayamdipta · PDF
  72. Extending AutoCompressors via Surprisal-Based Dynamic Segmentation

    Srivishnu Ramamurthi, Richard Xu, Raine Ma, Dawson Park, David Guo, Charles Duong, Vasu Sharma, Sean O'Brien, Kevin Zhu · PDF
  73. FEval-TTC: Fair Evaluation Protocol for Test-Time Compute

    Pavel Rumiantsev, Soumyasundar Pal, Yingxue Zhang, Mark Coates · PDF
  74. From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

    Seng Pei Liew, Takuya Kato · PDF
  75. From Bias to Balance: How Multilingual Dataset Composition Affects Tokenizer Performance Across Languages

    Aishwarya Selvamurugan, Raj Dandekar, Rajat Dandekar, Sreedath Panat · PDF
  76. From Many Voices to One: Statistically Principled Aggregation of LLM Judges

    Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala · PDF
  77. GASLIGHTBENCH: Quantifying LLM Susceptibility to Social Prompting

    Xuanzhe Yao, Sahil Ghosh, Gareth Lee, William H. Logian, Lening Nick Cui, Ellie Podoshev, Swarit Srivastava, Michael Li, Aaron Sandoval, Sean O'Brien, Michael Saxon, Sunishchal Dev, Kevin Zhu · PDF
  78. Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution

    Yash Saxena, Raviteja Bommireddy, Ankur Padia, Manas Gaur · PDF
  79. GermanPartiesQA: Benchmarking Commercial Large Language Models and AI Companions for Political Alignment and Sycophancy

    Jan Batzner, Volker Stocker, Stefan Schmid, Gjergji Kasneci · PDF
  80. GUARD: Guiding Unbiased Alignment through Reward Debiasing

    Advay Samnerkar, Doelle Bhattacharya, Kailash Ranganathan, Ashwinee Panda, Kevin Zhu · PDF
  81. Haystack Engineering: Context Engineering Meets the Long-Context Challenge in Large Language Models

    Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Peter Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li · PDF
  82. HORIZON: A Benchmark for In-the-wild User Behaviour Modeling

    Arnav Goel, Pranjal A Chitale, Bhawna Paliwal, Bishal Santra, Amit Sharma · PDF
  83. How Benchmark Prediction from Fewer Data Misses the Mark

    Guanhua Zhang, Florian E. Dorner, Moritz Hardt · PDF
  84. How Many Instructions Can LLMs Follow at Once?

    Daniel Jaroslawicz, Brendan Whiting, Parth Shah, Karime Maamari · PDF
  85. How to Get Your LLM to Generate Challenging Problems for Evaluation

    Arkil Patel, Siva Reddy, Dzmitry Bahdanau · PDF
  86. Human-Centric Framework for Large Multimodal Models Evaluation

    Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund Sayeeganesh Chettiar, Deval Pandya · PDF
  87. Husky Hold'em Benchmark: Can LLMs Design Competitive Poker Bots?

    Bhavesh Kumar, Hoang Doan Nguyen, Roger Jin, Ryan Teknium, Jeffrey Quesnelle · PDF
  88. HypoTermInstruct: Instructing Large Language Models not to Hallucinate

    CEM ULUOGLAKCI, Tugba Taskaya Temizel · PDF
  89. Improving Automated LLM Evaluation by Introducing Personas in LLM Red-Teaming

    Wesley Deng, Sunnie S. Y. Kim, Akshita Jha, Ken Holstein, Motahhare Eslami, Lauren Wilcox, Leon Alexander Gatys · PDF
  90. In-Context Learning for Esoteric Programming Languages: Evaluating and Enhancing LLM Reasoning Without Fine-Tuning

    Saraswathy Amjith, Michael X. Wang, Arul Kolla, Jayson Lynch, Neil Thompson · PDF
  91. In-Context Meta-Learning with Large Language Models for Automated Model and Hyperparameter Selection

    Youssef Attia El Hili, Albert Thomas, Abdelhakim Benechehab, Corentin Léger, Corinne Ancourt, Balázs Kégl · PDF
  92. JOINTMMSAFE: A Combinatorial Safety Benchmark for Multimodal Foundation Models

    Shruti Palaskar, Leon Alexander Gatys, Mona Abdelrahman, Mar Jacobo, Laurence F Lindsey, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey P. Bigham, Charles Maalouf, Joseph Yitan Cheng · PDF
  93. Justice in Judgment: Unveiling (Hidden) Bias in LLM-Assisted Peer Reviews

    Sai Suresh Macharla Vasu, Ivaxi Sheth, Hui-Po Wang, Ruta Binkyte, Mario Fritz · PDF
  94. Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale

    Bowen Jiang, Zhuoqun Hao, Young Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo Jose Taylor, Dan Roth · PDF
  95. Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training

    Figarri Keisha, Zekun Wu, Ze Wang, Adriano Koshiyama, Philip Colin Treleaven · PDF
  96. LaTeXBench: Judge-Only Evaluation of LaTeX Generation, Minimal-Edit Compliance, and Blind Contrast Errors

    Ishaan Gangwani, Soham Sen, Aayam Bansal · PDF
  97. Learning from Generalization Patterns: An Evaluation-Driven Approach to Enhanced Data Augmentation for Fine-Tuning Small Language Models

    Huan Song, Deeksha Razdan, Yiyue Qian, Arijit Ghosh Chowdhury, Parth Patwa, Aman Chadha, Shinan Zhang, Sharlina Keshava, Hannah R Marlowe · PDF
  98. LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

    Ananya Raval, Aravind Narayanan, Vahid Reza Khazaie, Shaina Raza · PDF
  99. LLMs as Judges for Domain-Specific Text: Evidence from Drilling Reports

    Abdallah Benzine, Soumyadipta Sengupta, Sebastiaan Buiting, Imane Khaouja, Yahia Salaheldin Shaaban, Amine EL KHAIR · PDF
  100. LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests

    Juan Miguel Navarro Carranza · PDF
  101. LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives

    Ratna Kandala, Katie Hoemann · PDF
  102. MAGNET: Mathematical Assurance of Generative AI Network Evaluation Toolkit

    Jon Crall, David Joy, Roderic Collins, Benjamin Fenelon, Anthony Hoogs, Brian H Hu · PDF
  103. MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains

    Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Jingrui He · PDF
  104. MEAL: A Multi-dimensional Evaluation of Alignment Techniques for LLMs

    Muneeza Azmat, Momin Abbas, Maysa Macedo, Marcelo Carpinette Grave, Luan Soares de Souza, Tiago Lemos de Araujo Machado, Rogério Abreu de Paula, Raya Horesh, Yixin Chen, Heloisa Candello, Rebecka Nordenlöw, Aminat Adebiyi · PDF
  105. Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

    Olawale Elijah Salaudeen, Anka Reuel, Ahmed M Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Benjamin W. Domingue, Angelina Wang, Sanmi Koyejo · PDF
  106. MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

    Shan Chen, Pedro José Ferreira Moreira, Yuxin Xiao, Samuel Schmidgall, Jeremy L. Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, Danielle Bitterman · PDF
  107. Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation

    Ahmed Tamer El Boardy, Ghada Khoriba, Essam Rashed · PDF
  108. MermaidSeqBench: An Evaluation Benchmark for LLM-to-Mermaid Sequence Diagram Generation

    Basel Shbita, Farhan Ahmed, Chad DeLuca · PDF
  109. Metrics for Holistic Evaluation of LLM Reasoning about Action, Change, and Planning

    Anil B Murthy, Jaron Mink, Lindsay Sanneman · PDF
  110. Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

    Ilham Wicaksono, Zekun Wu, Rahul Patel, Theo King, Adriano Koshiyama, Philip Colin Treleaven · PDF
  111. MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

    Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Nguyen Dao Minh Anh, Shivank Garg, Kevin Zhu, Vasu Sharma · PDF
  112. Mitigating Self-Preference by Authorship Obfuscation

    Taslim Mahbub, Shi Feng · PDF
  113. MonitorLLM: Real-Time Structural and Bias Evaluation of Generative AI through Knowledge Graphs

    Mohd Ariful Haque, kishor datta gupta, Roy George · PDF
  114. MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

    Aditya Aggarwal, Mehul Agarwal, Arnav Goel, Medha Hira, Anubha Gupta · PDF
  115. Narrow RL Induces Broad Behavior Changes in LLMs

    Jo J. Jiao, Austin C. Kozlowski, James Evans · PDF
  116. Network Dynamics Reasoning: A Novel Benchmark for Evaluating Multi-Step Inference in Large Language Models

    Andrew Bae, Saaketh Bhojanam, Laksh Patel · PDF
  117. No Question, No Passage, No Problem: Investigating Artifact Exploitation and Reasoning in Multiple-Choice Reading Comprehension

    Anthony Cui, Rohan Raj Butani, Theodore Oltean · PDF
  118. No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

    Tao Zhang, Kehui Yao, Luyi Ma, Reza Yousefi Maragheh, Jiao Chen, Kai Zhao, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, Kannan Achan · PDF
  119. On Evaluating Methods vs. Evaluating Models

    Olawale Elijah Salaudeen, Florian E. Dorner, Peter Hase · PDF
  120. OpenGovCorpus: Evaluating LLMs on Citizen Query Tasks

    Neil Majithia, Rajat Shinde, Manil Maskey, Elena Simperl · PDF
  121. OPTiCAL: An Abstract Positional Reasoning Benchmark for Vision Language Models

    Christopher Driggers-Ellis, Gabriel Ayoubi, Christan Grant · PDF
  122. Paraphrasing Away Malicious Tokens: Improving LLM Finetuning Safety by Filtering Spurious Correlation

    Marcel Mateos Salles, Praney Goyal, Pradyut Sekhsaria, Hai Huang, Randall Balestriero · PDF
  123. PEBBLE: A Pedagogical and SRL-Aware Benchmark for Evaluating LLM Tutors

    Ishaan Gangwani, Harrish Ayyanar, Arjun Rawal · PDF
  124. Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects

    Gunmay Handa, Zekun Wu, Adriano Koshiyama, Philip Colin Treleaven · PDF
  125. Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025

    Jiahao Qiu, Jingzhe Shi, Xinzhe Juan, Zelin Zhao, Jiayi Geng, Shilong Liu, Hongru WANG, Sanfeng Wu, Mengdi Wang · PDF
  126. PosterSum: A Multimodal Benchmark for Scientific Poster Summarization

    Rohit Saxena, Pasquale Minervini, Frank Keller · PDF
  127. Precision Shapes Personality: The Hidden Cost of Quantization in Sub-Billion-LLMs

    Soham Sen, Ishaan Gangwani · PDF
  128. Precursors, Proxies, and Predictive Models for Long-Horizon Tasks

    Samuel F. Brown, Jaco Du Toit, Leo Hyams, Daniil Anisimov · PDF
  129. Predicting Emergent Software Engineering Capabilities by Fine-tuning

    Jason J Jackson, Terry Huang, Henry Velasquez, Kevin Zhu, Sunishchal Dev · PDF
  130. Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

    Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong, Haihao Liu, Vasu Sharma, Kevin Zhu · PDF
  131. Probing Reasoning Flaws and Safety Hierarchies with Chain-of-Thought Difference Amplification

    Kamesh R · PDF
  132. Progress over Points: Reframing LM Benchmarks Around Scientific Objectives

    Alwin Jin, Sean M. Hendryx, Vaskar Nath · PDF
  133. Prompt Genotyping: Quantifying the Evaluation Gap Between Synthetic Benchmarks and Real LLM Performance

    Sohum Mehta, Saaketh Bhojanam · PDF
  134. PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

    Tatsuki Kawakami, Kazuki Egashira, Atsuyuki Miyai, Go Irie, Kiyoharu Aizawa · PDF
  135. R3: Robust Rubric-Agnostic Reward Models

    David Anugraha, Zilu Tang, Lester James Validad Miranda, Hanyang Zhao, Shou-Yi Hung, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Tanti Wijaya, Genta Indra Winata · PDF
  136. RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation

    Ashish Kattamuri, Harshwardhan Fartale, Arpita Vats, Rahul Raja, Ishita Prasad · PDF
  137. Recovery-Bench: Evaluating Agentic Recovery from Mistakes

    Shangyin Tan, Kevin Lin, Koushik Sen, Matei Zaharia · PDF
  138. Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

    Sungjun Cho, Dasol Hwang, Frederic Sala, Sangheum Hwang, Kyunghyun Cho, Sungmin Cha · PDF
  139. RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

    Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab · PDF
  140. RELIC: Evaluating Compositional Instruction Following via Language Recognition

    Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, Tal Linzen · PDF
  141. RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents

    Sidney Black, Asa Cooper Stickland, Jake Pencharz, Oliver Sourbut, Michael Schmatz, Jay Bailey, Ollie Matthews, Ben Millwood, Alex Remedios, Alan Cooney · PDF
  142. Rethinking Kernel Program Repair: Benchmarking and Enhancing LLMs with RGym

    Kareem Shehada, Yifan Wu, Wyatt D. Feng, Adithya Iyer, Gryphon Kumfert, Yangruibo Ding, Zhiyun Qian · PDF
  143. Rethinking MCQ Benchmarks: Mandatory Reasoning Evaluation Reveals Significant Performance Drops in Large Language Models

    Yue Zhang, Nhan Nguyen · PDF
  144. Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs

    Jacob Portes, Connor Jennings, Erica Ji Yuen, Sasha Doubov, Michael Carbin · PDF
  145. Reward Model Overoptimisation in Iterated RLHF

    Lorenz Wolf, Robert Kirk, Mirco Musolesi · PDF
  146. RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning

    Ziye Chen, Chengwei Qin, Yao Shu · PDF
  147. RULERv2: From Basic Retrieval to Complex Reasoning, A Bottom-Up Benchmark for Long-Context Evaluation

    Cheng-Ping Hsieh, Faisal Ladhak, Krishna C Puvvada, Boris Ginsburg · PDF
  148. SAGE: A Realistic Benchmark for Semantic Understanding

    Samarth Goel, Reagan Lee, Kannan Ramchandran · PDF
  149. SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

    Anjiang Wei, Yuheng Wu, Yingjia Wan, Tarun Suresh, Huanmi Tan, Zhanke Zhou, Sanmi Koyejo, Ke Wang, Alex Aiken · PDF
  150. Scaling Laws for Upcycling Mixture-of-Experts Language Models

    Seng Pei Liew, Takuya Kato, Sho Takase · PDF
  151. Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

    Jiaqi Yin, Yi-Wei Chen, MENG-LUNG LEE, Xiya Liu · PDF
  152. Search-Time Data Contamination

    Ziwen Han, Meher Mankikar, Julian Michael, Zifan Wang · PDF
  153. Self-Correction Bench: Revealing the Self-Correction Blind Spot in LLMs

    Ken Tsui · PDF
  154. Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection

    Vaibhav Mavi, Shubh Jaroria, Weiqi Sun · PDF
  155. Silent Tokens, Loud Effects: Padding in LLMs

    Rom Himelstein, Amit LeVi, Yonatan Belinkov, Avi Mendelson · PDF
  156. Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts

    Preethi Seshadri, Hongyu Chen, Sameer Singh, Seraphina Goldfarb-Tarrant · PDF
  157. Smarter Sampling for LLM Judges: Reliable Evaluation on a Budget

    Alyssa Unell, Natalie Dullerud, Nigam Shah, Sanmi Koyejo · PDF
  158. SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

    Natalia Tarasova, Enrique Balp-Straffon, Aleksei Iancheruk, Yevhenii Sielskyi, Nikita Kozodoi, Liam H. Byrne, Jack Butler, Dayuan jiang, Marcin Czelej, Andrew Ang, Yash Shah, Roi Blanco, Sergei Ivanov · PDF
  159. Sycophancy Claims about Language Models: The Missing Human-in-the-Loop

    Jan Batzner, Volker Stocker, Stefan Schmid, Gjergji Kasneci · PDF
  160. T-FIX: Text-Based Explanations with Features Interpretable to eXperts

    Shreya Havaldar, Helen Jin, Chaehyeon Kim, Anton Xue, Weiqiu You, Gary E. Weissman, Rajat Deo, Sameed Ahmed M. Khatana, Helen Qu, Marco Gatti, Daniel A Hashimoto, Amin Madani, Masao Sako, Bhuvnesh Jain, Lyle Ungar, Eric Wong · PDF
  161. Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLMs

    Sankalp Tattwadarshi Swain, Anshika Krishnatray, Dhruv Kumar, Jagat Sesh Challa · PDF
  162. The Contamination Paradox: Why Test Set Leakage Can Be Both Potent and Negligible

    Rylan Schaeffer, Ken Liu, Brando Miranda, Ahmed M Ahmed, Niloofar Mireshghallah, Sanmi Koyejo · PDF
  163. The Impact of Post-training on Data Contamination

    Muhammed Yusuf Kocyigit · PDF
  164. The Measure of All Measures: Quantifying LLM Benchmark Quality

    Jihan Yao, Peter Jin, Ke Bao, Qiaolin Yu, Khushi Bhardwaj, Chang Su, Jialei Wang, YIKAI ZHU, Sugam Devare, Damon Mosk-Aoyama, Zhen Dong, Venkat Krishna Srinivasan, Yineng Zhang, Oleksii Kuchaiev, Jiantao Jiao, Banghua Zhu · PDF
  165. The Narcissus Hypothesis: Descending to the Rung of Illusion

    Riccardo Cadei, Christian Internò · PDF
  166. The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

    İbrahim Ethem Deveci, Duygu Ataman · PDF
  167. The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

    Hans Gundlach, Jayson Lynch, Matthias Mertens, Neil Thompson · PDF
  168. The Shepherd Test: How Will SuperIntelligent Agents Balance Care and Control in Asymmetric Relationships?

    Djallel Bouneffouf, Matthew Riemer, Kush R. Varshney · PDF
  169. The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation

    Zarreen Reza · PDF
  170. Towards Dynamic KV-Cache Compression: Fine-Grained Evaluation of Key and Value Ranks in LLMs

    Jian Chen, Zhuoran Wang, Jiayu Qin, Ming Li, Meng Wang, Changyou Chen, Yin Chen, Qizhen Weng, Yirui Liu · PDF
  171. Towards Multilingual Mechanistic Interpretability

    Yanan Long · PDF
  172. Towards Real-World Evaluation of Agentic Work in Freelance Marketplaces

    Mattie Terzolo, Darvin Yi, Teng Liu, Lance Hasson, Ayan Sinha, Pablo N. Mendes, Andrew Rabinovich · PDF
  173. Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

    Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich · PDF
  174. Train-before-Test Harmonizes Language Model Rankings

    Guanhua Zhang, Ricardo Dominguez-Olmedo, Moritz Hardt · PDF
  175. TrolleyBench: Evaluating Emergent Moral Reasoning and Consistency in LLMs

    Andrew Zhu · PDF
  176. Uncertainty Quantification for Language Models: Standardizing and Evaluating Black-Box, White-Box, LLM Judge, and Ensemble Scorers

    Dylan Bouchard, Mohit Singh Chauhan · PDF
  177. UQ: Assessing Language Models on Unsolved Questions

    Fan Nie, Ken Liu, Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, Huaxiu Yao, Linjun Zhang, Andrew Y. Ng, James Zou, Sanmi Koyejo, Yejin Choi, Percy Liang, Niklas Muennighoff · PDF
  178. VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT

    Hyeonsu B Kang, Yuwei Bao, Anjan Goswami · PDF
  179. When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

    Xunyi Jiang, Dingyi Chang, Xin Xu · PDF
  180. When LLM Meets Time Series: Can LLMs Perform Multistep Time Series Reasoning and Inference

    Wen Ye, Jinbo Liu, Defu Cao, Wei Yang, Yan Liu · PDF
  181. Where Did It All Go Wrong? A Hierarchical Look into Multi-Agent Error Attribution

    Adi Banerjee, Anirudh Nair, Tarik Borogovac · PDF
  182. Who Routes the Router: Rethinking the Evaluation of LLM Routing Systems

    Jiayi Yuan, Yifan Lu, Rixin Liu, Yu-Neng Chuang, Hongyi Liu, Shaochen Zhong, Yang Sui, Guanchu Wang, Jiarong Xing, Xia Hu · PDF
  183. Who’s the Impostor? Multi‑Agent Social Deduction for Evaluating LLM Social Reasoning

    Xiang Fu · PDF
  184. Whose Personae? Synthetic Persona Experiments in LLM Research and Pathways to Transparency

    Jan Batzner, Volker Stocker, Bingjun Tang, Anusha Natarajan, Qinhao Chen, Stefan Schmid, Gjergji Kasneci · PDF
  185. Why Do Multi-Agent LLM Systems Fail?

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica · PDF
  186. YKSBench: Stress-Testing Multimodal Models with Exam-Style Questions

    Egemen Sert, Seyda Ertekin · PDF