NeurIPS 2025 Past Math & reasoningLarge language models
The 5th Workshop on Mathematical Reasoning and AI
MATH-AI
Unverified seed entry. Some fields are estimates — confirm everything on the official website before planning a submission.
- Submission deadline
- Aug 22, 2025, 23:59 AoE (UTC−12) SEED estimate of the historical deadline — verify
- Workshop day
- Dec 6, 2025
- Submission portal
- OpenReview
- Notes
- SEED DATA — name/website/date taken from the OpenReview venue record; verify remaining fields.
Previous editions
Accepted papers (150)
Fetched from OpenReview (v2) on 2026-06-10.
-
\textsc{Gambit}: Generating Automated Mathematical Bounds, Inequalities, and Theorems
-
A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models
-
A NUMA Aware Compiler Framework for Large Scale Mathematical Reasoning Inference on PCIe Based Multi Accelerator Systems
-
A Small Math Model: Recasting Strategy Choice Theory in an LLM-Inspired Architecture
-
A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation
-
Adaptive Control for Test-time Scaling
-
Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning
-
AI Impact on Human Proof Formalization Workflows
-
AI-Driven Mathematical Discovery for the Andrews–Curtis Conjecture
-
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
-
Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization
-
Analytical Lyapunov Function Discovery: An RL-based Generative Approach
-
AntiderivBench: Evaluating language models on indefinite integration
-
ARM: Discovering Agentic Reasoning Modules for Mathematical Problem-Solving
-
Aryabhata: An exam-focused language model for JEE Math
-
Automated Discovery of Conservation Laws via Hybrid Neural ODE-Transformers
-
Axiom-Aware FunSearch for Non-Constructive Mathematics
-
Babel-formal: Translation of Proofs between Lean and Rocq
-
Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning
-
Beyond Accuracy: Evaluating Multimodal Mathematical and Scientific Reasoning Through Error Analysis and Self-Correction
-
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
-
Blind Spot Navigation in Large Language Model Reasoning with Thought Space Explorer
-
Bridging Vision, Language, and Mathematics: Pictographic Character Reconstruction with Bézier Curves
-
BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs
-
Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework
-
CauSciBench: Assessing LLM Causal Reasoning for Scientific Research
-
CayleyPy Growth: Efficient growth computations and hundreds of new conjectures on Cayley graphs
-
CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
-
CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process
-
Climbing the Ladder of Reasoning: What LLMs Can—and Still Can’t—Solve after SFT?
-
Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
-
CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for Language Models
-
CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning
-
Combining Textual and Structural Information for Premise Selection in Lean
-
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
-
Concept Generalization in Humans and Large Language Models: Insights from the Number Game
-
Controllable Mathematical Reasoning via Self-Optimizing Thought Vectors
-
Credit Cards, Confusion, Computation, and Consequences: How Well Do LLMs Reason About Financial Literacy?
-
Curiosity-driven RL for symbolic equation solving
-
DAG-Math: Graph-Guided Mathematical Reasoning in LLMs
-
Decompose, Adapt, and Evolve: Towards Efficient Scientific Equation Discovery with Large Language Models
-
Decoupling Reasoning from Proving: A New Framework for Tackling Olympiad-Level Mathematics
-
DELTA: How Does RL Unlock and Transfer New Algorithms in LLMs?
-
DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation
-
DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning
-
EchoRL: Learning to Plan through Experience for Efficient Reinforcement Learning
-
Evaluating Spatial Reasoning in Language Models
-
Exact Learning of Arithmetic with Differentiable Agents
-
Expanding the Action Space of LLMs to Reason Beyond Language
-
Faults in our Formal Benchmarks
-
FoCus: Improving Faithfulness in Chain-of-Thoughts by Training on Structured Reasoning Data
-
FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory
-
FractalBench: Diagnosing Visual-Mathematical Reasoning Through Recursive Program Synthesis
-
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute
-
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
-
HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
-
Hilbert: Recursively Building Formal Proofs with Informal Reasoning
-
How does RL induce skill composition? A Case Study using Countdown
-
HYBRIDMIND: Meta Selection of Natural Language and Symbolic Language for Enhanced LLM Reasoning
-
I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models
-
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation
-
Improving autoformalization via cycle consistency and incremental type-checking using language-model probabilistic programs
-
Improving ML attacks on LWE with data repetition and stepwise regression
-
In Good GRACES: Principled Teacher Selection for Knowledge Distillation
-
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
-
Infinite-Dimensional HiPPO Provides an Explicit Formula for LSSLs
-
Inpainting-Guided Policy Optimization for Diffusion Large Language Models
-
Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles
-
Kimina Lean Server: A High-Performance Lean Server for Large-Scale Verification
-
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
-
Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training
-
LeanDojo-v2: A Comprehensive Library for AI-Assisted Theorem Proving in Lean
-
Learning How to Use Tools, Not Just When: Pattern-Aware Tool-Integrated Reasoning
-
Learning Modular Exponentiation with Transformers
-
Learning Permuted Congruential Sequences with Transformers
-
Learning to Reason on Hard Problems with Privileged On-Policy Exploration
-
Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning
-
Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs
-
LLM-Generated Search Heuristics Can Solve Open Instances of Combinatorial Design Problems
-
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
-
MathBode: Understanding LLM Reasoning with Dynamical Systems
-
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
-
MathSticks: A Benchmark for Visual Symbolic Compositional Reasoning with Matchstick Puzzles
-
Measuring Off-Trajectory Math Reasoning of LLMs
-
Meta Thinker: Thinking What AI Thinks
-
Minif2f in Rocq: Automatic Translation Between Proof Assistants — A Case Study
-
Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions
-
Modeling Chain-of-Thought Collapse in Pruned Language Models: Fidelity and Similarity Analysis for Mathematical Reasoning
-
Nested Depth Generalization in Transformers
-
Numbers Already Carry Their Own Embeddings
-
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
-
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
-
On the Evolution of Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
-
One Token to Fool LLM-as-a-Judge
-
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
-
PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier
-
Patching Gaps In LLM Reasoning With Interventional Training
-
Pretraining Scaling Laws for Generative Evaluations of Language Models
-
PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning
-
Probabilistic Soundness Guarantees in LLM Reasoning Chains
-
Process-Verified Reinforcement Learning for Theorem Proving via Lean
-
ProofGym: Unifying LLM-Based Theorem Proving Across Formal Systems
-
ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations
-
ProxyThinker: Test-Time Guidance through Small Visual Reasoners
-
PVSGym: A Proof Learning Environment
-
Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead
-
R-Zero: Self-Evolving Reasoning LLM from Zero Data
-
RADAR: Reasoning–Ability and Difficulty-Aware Routing for Reasoning LLMs
-
RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval
-
RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows
-
Reinforcement Learning for Hierarchical Proof Generation in Lean 4
-
Reliable Fine-Grained Evaluation of Natural Language Math Proofs
-
Restructuring the Corpus Makes RAG Work for Math
-
Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces
-
Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models
-
RLVR vs. Distillation: Understanding Accuracy and Capability in LLM Mathematical Reasoning
-
SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers
-
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
-
Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection
-
Scaling up Multi-Turn Off-Policy RL and Multi-Agent Tree Search for LLM Step-Provers
-
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
-
SciML Agents: Write the Solver, Not the Solution
-
Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models
-
Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
-
Single-stream Policy Optimization
-
Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation
-
Specifying exact circuit algorithms in universal transformers
-
SPG: Sandwiched Policy Gradient for Mask Diffusion Language Models
-
SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
-
STAT: Skill-Targeted Adaptive Training
-
STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision
-
Stoic Reasoner: Dual-Mode Transformers that Compress to Think and Decompress to Speak
-
StreetMath: Study of LLMs’ Approximation Behaviors
-
Systematic Diagnosis of Brittle Reasoning in Large Language Models
-
Tales from a Graph: a Pipeline for Mathematical Problem Generation
-
Think, Align, Select: Query–Key Scores for LLM Reasoning
-
ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
-
Tool-Assisted Multi-Turn Theorem Proving with LLMs
-
Towards Scaling Laws for Symbolic Regression
-
Towards Understanding Self-play for LLM Reasoning
-
TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
-
Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning
-
Understanding Tool-Integrated Reasoning
-
Unspoken Logic: Understanding and bridging the gap between free-form and LLM-interpretable natural language mathematical proofs
-
Usefulness-Driven Learning of Formal Mathematics
-
VeriBench-FTP: A Formal Theorem Proving Benchmark in Lean 4 for Code Verification
-
Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients
-
Why Reinforcement Learning Struggles with Expression Simplification: A Reward Analysis
-
Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline
-
You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models