ICLR 2026PastOther

ICLR 2026 Workshop: VerifAI-2: The Second Workshop on AI Verification in the Wild

ICLR 2026 Workshop VerifAI-2

Official website ↗OpenReview venue ↗See all ICLR workshops →✎ Edit this entry

Submission deadline: Feb 9, 2026, 11:59 UTC
OpenReview-synced 2026-02-09 11:59 UTC (as of 2026-06-23) — extensions on OpenReview are applied automatically; verify on the website.
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (39)

Fetched from OpenReview (v2) on 2026-06-10.

A NASH EQUILIBRIUM FRAMEWORK FOR TRAINING FREE MULTIMODAL STEP VERIFICATION
Rohit Sinha, Kunal Tilaganji, Tanuja Ganu, Nagarajan Natarajan, Amit Sharma, Vineeth N. Balasubramanian
A Minimal Agent for Automated Theorem Proving
Borja Requena, Austin Letson, Krystian Nowakowski, Izan Beltran Ferreiro, Leopoldo Sarra
Agentic Uncertainty Reveals Agentic Overconfidence
Jean Kaddour, Srijan Patel, Gbetondji Jean-Sebastien Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner
Autoformalizing Memory Device Specifications with Agents
Jan Ole Ernst, Dmitri Michelangelo Saberi, Thomas Zimmermann, Derek Christ, Rajath Salegame, Suhaas M Bhat, Stanislav Levental, Thomas Dybdahl Ahle, Matthias Jung
Beaver: An Efficient Deterministic LLM Verifier
Tarun Suresh, Nalin Wadhwa, Debangshu Banerjee, Gagandeep Singh
Benchmarking Code Verification Strategies with LLMs-as-a-judge
Arnav Kumar Jain, Justin T Chiu, Tom Sherborne, Matthias Gallé
Beyond Self-Checking: Fragment-Level Verification Across Diverse LLMs
Ken Mueller, Arihant Choudhary, David Perez, Scott Mueller
Computational Arbitrage in AI Model Markets
Ricardo Olmedo, Bernhard Schölkopf, Moritz Hardt
Conv-to-Bench: Evaluating Language Models Via User–Assistant Dialogues In Code Tasks
Victor Moreli dos Santos, André Cerqueira Castro, Samuel Lopes de Souza Toledo, Bruno Moreira Lavalli Calura, Lisandra Cristina de Moura Menezes, Raul César Reis Mata, Telma Woerle de Lima Soares, Bryan Lincoln Marques de Oliveira
DafnyLLM: Pre-training Dafny Representations with Large Language Models for Code Verification
Shentong Mo
Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning
Kyuhee Kim, Auguste Poiroux, Antoine Bosselut
Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy
Andrea Brunello, Luca Geatti, Michele Mignani, Angelo Montanari, Nicola Saccomanno
Enforcing Temporal Constraints for LLM Agents
Adharsh Kamath, Sishen Zhang, Changming Xu, Shubham Ugare, Gagandeep Singh, Sasa Misailovic
Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning
Manan Tayal, Mumuksh Tayal
Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence
Bingji Yi, Qiyuan Liu, Yuwei Cheng, Haifeng Xu
Evaluating Agentic Optimization on Large Codebases
Atharva Sehgal, James Hou, Akanksha Sarkar, Ishaan Mantripragada, Swarat Chaudhuri, Jennifer J. Sun, Yisong Yue
FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, Langston Nashold
Geometry of Reason: Probabilistic Spectral Verification for Mathematical Reasoning
Valentin NOËL
GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification
Yichi Zhang, Nabeel Seedat, Yinpeng Dong, Peng Cui, Jun Zhu, Mihaela van der Schaar
Grounding Long-Horizon Agent Coordination in GUI Environments via Contract-based Structural Planning
Hao Yu, Weiming Li, Yueming Lyu, Jie-Jing Shao, Yulei Sui, Ivor Tsang, Haiyan Yin
Identifying and Mitigating Reasoning Errors in VLM Verifiers via Activation Decomposition
Joonhyuk Cha, Moises Andrade, Zsolt Kira
interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors
Vishak K Bhat, Prateek Chanda, Ashmit Khandelwal, Maitreyi Swaroop, Subbarao Kambhampati, Vineeth N. Balasubramanian, Nagarajan Natarajan, Amit Sharma
ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?
Ayush Nangia, Shikhar Mishra, Aman Gokrani, Paras Chopra
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math
Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Hyunwoo Ko, Amit Agarwal, Sunghee Ahn, Kyong-Ha Lee, Youngjae Yu
Learning from Synthetic Data Improves Multi-hop Reasoning
Anmol Kabra, Yilun Yin, Albert Gong, Kamilė Stankevičiūtė, Dongyoung Go, Johann Lee, Katie Z Luo, Carla P Gomes, Kilian Q Weinberger
Learning to Rank the Initial Branching Order of SAT Solvers
Arvid Eriksson, Gabriel Poesia, Roman Bresson, Karl Henrik Johansson, David Broman
Learning to Repair Lean Proofs from Compiler Feedback
Evan Wang, Simon Chess, Daniel Lee, Siyuan Ge, Ajit Mallavarapu, Vasily Ilin
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son, Dongkeun Yoon, Juyoung Suk, Javier Aula-Blasco, Mano Aslan, Kim Vu, Shayekh Bin Islam, Jaume Prats-Cristià, Lucía Tormo-Bañuelos, Seungone Kim
NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference
Zhaohui Geoffrey Wang
ProofRepairBench: Exploring Proof Repair in Lean
Manooshree Patel, Bartosz Piotrowski, Leopold Haller, Hugh James Leather
Quokka: Accelerating Program Verification with LLMs via Invariant Synthesis
Anjiang Wei, Tarun Suresh, Tianran Sun, Haoze Wu, Ke Wang, Alex Aiken
ROC-n-reroll: How verifier imperfection affects test-time scaling
Florian E. Dorner, Yatong Chen, André F Cruz, Fanny Yang
RocqSmith: Can Automatic Optimization Forge Better Proof Agents?
Andrei Kozyrev, Nikita Khramov, Denis Lochmelis, Valerio Morelli, Gleb Solovev, Anton Podkopaev
Scaling Evaluation-Time Compute with Reasoning Models as Process Evaluators
Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Minkyeong Moon, Carolin Lawrence, Kiril Gashteovski, Julia Hockenmaier, Graham Neubig, Sean Welleck
SorryDB: Can AI Provers Complete Real-World Lean Theorems?
Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, Lenny Taelman
The Dual Nature of Unlearning: Impact of Fact Salience and Model Fine-Tuning
Anna Borisiuk, Andrey Savchenko, Alexander Panchenko, Elena Tutubalina
ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents
Dawei Li, Yuguang Yao, Zhen Tan, huan liu, Ruocheng Guo
Unified Operational Formalism for LLM-based Theorem-proving Systems
Avaljot Singh, Shaurya Gomber, Yasmin Sarita, José Meseguer, Gagandeep Singh
Verification Limits Code LLM Training
Srishti Gureja, Marzieh Fadaee, Sara Hooker, Matthias Gallé, Jingyi He, Elena Tommasone

Accepted papers (39)

☆A NASH EQUILIBRIUM FRAMEWORK FOR TRAINING FREE MULTIMODAL STEP VERIFICATION

☆A Minimal Agent for Automated Theorem Proving

☆Agentic Uncertainty Reveals Agentic Overconfidence

☆Autoformalizing Memory Device Specifications with Agents

☆Beaver: An Efficient Deterministic LLM Verifier

☆Benchmarking Code Verification Strategies with LLMs-as-a-judge

☆Beyond Self-Checking: Fragment-Level Verification Across Diverse LLMs

☆Computational Arbitrage in AI Model Markets

☆Conv-to-Bench: Evaluating Language Models Via User–Assistant Dialogues In Code Tasks

☆DafnyLLM: Pre-training Dafny Representations with Large Language Models for Code Verification

☆Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning

☆Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

☆Enforcing Temporal Constraints for LLM Agents

☆Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

☆Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence

☆Evaluating Agentic Optimization on Large Codebases

☆FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

☆Geometry of Reason: Probabilistic Spectral Verification for Mathematical Reasoning

☆GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

☆Grounding Long-Horizon Agent Coordination in GUI Environments via Contract-based Structural Planning

☆Identifying and Mitigating Reasoning Errors in VLM Verifiers via Activation Decomposition

☆interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors

☆ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

☆Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

☆Learning from Synthetic Data Improves Multi-hop Reasoning

☆Learning to Rank the Initial Branching Order of SAT Solvers

☆Learning to Repair Lean Proofs from Compiler Feedback

☆MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

☆NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference

☆ProofRepairBench: Exploring Proof Repair in Lean

☆Quokka: Accelerating Program Verification with LLMs via Invariant Synthesis

☆ROC-n-reroll: How verifier imperfection affects test-time scaling

☆RocqSmith: Can Automatic Optimization Forge Better Proof Agents?

☆Scaling Evaluation-Time Compute with Reasoning Models as Process Evaluators

☆SorryDB: Can AI Provers Complete Real-World Lean Theorems?

☆The Dual Nature of Unlearning: Impact of Fact Salience and Model Fine-Tuning

☆ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

☆Unified Operational Formalism for LLM-based Theorem-proving Systems

☆Verification Limits Code LLM Training

A NASH EQUILIBRIUM FRAMEWORK FOR TRAINING FREE MULTIMODAL STEP VERIFICATION

A Minimal Agent for Automated Theorem Proving

Agentic Uncertainty Reveals Agentic Overconfidence

Autoformalizing Memory Device Specifications with Agents

Beaver: An Efficient Deterministic LLM Verifier

Benchmarking Code Verification Strategies with LLMs-as-a-judge

Beyond Self-Checking: Fragment-Level Verification Across Diverse LLMs

Computational Arbitrage in AI Model Markets

Conv-to-Bench: Evaluating Language Models Via User–Assistant Dialogues In Code Tasks

DafnyLLM: Pre-training Dafny Representations with Large Language Models for Code Verification

Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning

Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

Enforcing Temporal Constraints for LLM Agents

Epigraph-Guided Flow Matching for Safe and Performant Offline Reinforcement Learning

Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence

Evaluating Agentic Optimization on Large Codebases

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Geometry of Reason: Probabilistic Spectral Verification for Mathematical Reasoning

GLEAN: Guideline-Grounded Evidence Accumulation for High-Stakes Agent Verification

Grounding Long-Horizon Agent Coordination in GUI Environments via Contract-based Structural Planning

Identifying and Mitigating Reasoning Errors in VLM Verifiers via Activation Decomposition

interwhen: A Generalizable Framework for Verifiable Reasoning with Test-time Monitors

ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?

Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Learning from Synthetic Data Improves Multi-hop Reasoning

Learning to Rank the Initial Branching Order of SAT Solvers

Learning to Repair Lean Proofs from Compiler Feedback

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference

ProofRepairBench: Exploring Proof Repair in Lean

Quokka: Accelerating Program Verification with LLMs via Invariant Synthesis

ROC-n-reroll: How verifier imperfection affects test-time scaling

RocqSmith: Can Automatic Optimization Forge Better Proof Agents?

Scaling Evaluation-Time Compute with Reasoning Models as Process Evaluators

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

The Dual Nature of Unlearning: Impact of Fact Salience and Model Fine-Tuning

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Unified Operational Formalism for LLM-based Theorem-proving Systems

Verification Limits Code LLM Training