ICML 2025 Past Math & reasoningLarge language modelsTheory
2nd AI for Math Workshop
AI4Math
Unverified seed entry. Some fields are estimates — confirm everything on the official website before planning a submission.
- Submission deadline
- May 25, 2025, 23:59 AoE (UTC−12) SEED estimate of the historical deadline — verify
- Workshop day
- Jul 18, 2025
- Submission portal
- OpenReview
- Notes
- SEED DATA — name/website/date taken from the OpenReview venue record; verify remaining fields.
Accepted papers (112)
Fetched from OpenReview (v2) on 2026-06-10.
-
$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
-
A Comprehensive Evaluation of Contemporary Machine-Learning-Based Solvers for CO
-
A Compute-Matched Re-Evaluation of TroVE on MATH
-
A Markov Categorical Framework for Language Modeling
-
A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning
-
A Survey on Large Language Model Reasoning Failures
-
Ada-R1: Hybrid CoT via Bi-Level Adaptive Reasoning Optimization
-
Beyond Accuracy: A Policy Gradient Reweighting Approach for Pass@K Maximization in LLMs
-
Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs
-
Boolformer: Symbolic Regression of Logic Functions with Transformers
-
Boosting LLM Reasoning via Spontaneous Self-Correction
-
Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic
-
Chain-of-Thought Reasoning for Math: Theoretical Foundation and Applications
-
CLEVER: A Curated Benchmark for Formally Verified Code Generation
-
COAST: Intelligent Time-Adaptive Neural Operators
-
CoDaPO: Confidence and Difficulty-Adaptive Policy Optimization for Post-Training Language Models
-
DeduCE: Deductive Consistency as a Framework to Evaluate LLM Reasoning
-
Direct Induction Proof Challenge: Evaluating Large Language Models on Deeply Nested Mathematical Induction
-
Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO
-
Discrete Feynman-Kac Correctors
-
Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs
-
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
-
Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition
-
DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures
-
e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
-
Enhancing Graph Neural Network for Boolean Satisfiability Solving via Data Augmentation
-
Entropy-Based Adaptive Weighting for Self-Training
-
EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations
-
FMC: Formalization of Natural Language Mathematical Competition Problems
-
Forget Less, Solve More: Sequential Fine-Tuning with Adapter Shrinking for Math Word Problems
-
From Narrative to Formalism: A Case Study in the Origin of Molecular Translation System
-
Generalized Tree Edit Distance (GTED): A Faithful Evaluation Metric for Statement Autoformalization
-
GenSelect: A Generative Approach to Best-of-N
-
Governing Equation Discovery from Data Based on Differential Invariants
-
Graph Neural Networks for Tensor Product Decompositions of Lie Algebra Representations
-
How Is LLM Reasoning Distracted by Irrelevant Context? An Analysis Using a Controlled Benchmark
-
Inequality Ranking and Inference System ($\texttt{\textbf{IRIS}}$): Giving Mathematical Conjectures Numerical Value
-
Inferring Loop Invariants for Program Verification: an Abductive Learning Perspective
-
Instilling Parallel Reasoning into Language Models
-
IntegralBench: Benchmarking LLMs with Definite Integral Problems
-
InternLM2.5-StepProver: Advancing Automated Theorem Proving via Critic-Guided Search
-
KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment
-
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
-
Lean Finder: Semantic Search for Mathlib That Understands User Intents
-
Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs
-
LeanTree: Accelerating White-Box Proof Search with Factorized States in Lean 4
-
LeanTutor: A Lean-Verified Tutor for Mathematical Proofs
-
Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization
-
Learning Moderately Input-Sensitive Functions: A Case Study in QR Code Decoding
-
Learning to Discover Abstractions for LLM Reasoning
-
Learning to Reason without External Rewards
-
Learning to Solve Complex Problems via Dataset Decomposition
-
Learning-Guided Local Search for Asymmetric Traveling Salesman Problem
-
Lemmanaid: Neuro-Symbolic Lemma Conjecturing
-
Let’s Try Again: Eliciting Multi-Turn Reasoning in Language Models via Simplistic Feedback
-
Machine Learning and LLM-Boost Symbolic Regression for Predicting $\mathbb{Q}$-Gonality of Modular Curves
-
Majority of the Bests: Improving Best-of-N via Bootstrapping
-
MIRB: Mathematical Information Retrieval Benchmark
-
MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem
-
NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation
-
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
-
Not All Votes Count! Translated Program for Verification Improves Self-Consistency of Language Models for Math Reasoning
-
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
-
OctoThinker: Mid-Training Incentivizes Reinforcement Learning Scaling
-
Omni-Thinker: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards
-
On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
-
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
-
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
-
PADRE: Pseudo-Likelihood based Alignment of Diffusion Language Models
-
Physics-Constrained Symbolic Regression from Imagery
-
Plane Geometry Diagram Formalization via Vision-Language Models
-
POD-KAN-NO: a physically interpretable neural operator
-
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
-
ProofCompass: Enhancing Specialized Provers with LLM Guidance
-
ProofWala: Multilingual Proof Data Synthesis and Theorem-Proving
-
Prover Agent: An Agent-based Framework for Formal Mathematical Proofs
-
Putnam-AXIOM: A Functional and Static Benchmark
-
Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers
-
README: Rapid Equation Discovery with Multimodal Encoders
-
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics
-
Reinforcement Learning Teachers of Test Time Scaling
-
Reward Inside the Model: A Lightweight Hidden‑State Reward Model for LLM's Best-of-N sampling
-
Reward Under Attack: Evaluating the Sensitivity of Process Reward Models
-
RL‑QESA: Reinforcement‑Learning Quasi‑Equilibrium Simulated Annealing
-
Scalable Best-of-N Selection for Large Language Models via Self-Certainty
-
Scaling Mathematical Reasoning through Data, Tools, and Generative Selection
-
Scaling Natural-Language Graph-Based Test Time Compute for Automated Theorem Proving
-
Simple, Scalable Reasoning via Iterated Summarization
-
Small Models Struggle to Learn from Strong Reasoners
-
Solving Inequality Proofs with Large Language Models
-
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
-
Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO
-
SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning
-
Target-Based Automated Conjecturing for Neural Theorem Proving
-
Temporal Sampling for Forgotten Reasoning in LLMs
-
The Challenge of Teaching Reasoning to LLMs Without RL or Distillation
-
The Invisible Leash: Why RLVR May Not Escape Its Origin
-
The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
-
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
-
TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning
-
Token Hidden Reward: Steering Exploration-Exploitation in GRPO Training
-
Towards Geometry Problem Solving in the Large Model Era: A Survey
-
Training Language Models to Reason Efficiently
-
Understanding R1-Zero-Like Training: A Critical Perspective
-
Value-Guided Search for Efficient Chain-of-Thought Reasoning
-
VeriBench: End-to-End Formal Verification Benchmark for AI Code Generation in Lean 4
-
VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers
-
Verifying Prompt-Induced Search-Space Shifts in LLM-Generated Mathematical Functions
-
Verina: Benchmarking Verifiable Code Generation
-
Vision Language Models are Biased: Counting legs of an animal is surprisingly hard
-
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
-
Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions