ICLR 2025 Past Math & reasoningLarge language models
Workshop on Reasoning and Planning for Large Language Models
LLM_Reason_and_Plan
- Submission deadline
- Feb 9, 2025, 21:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (110)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Simple Model of Inference Scaling Laws
-
Adaptive Self-improvement LLM Agentic System for ML Library Development
-
Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations
-
Agentic Knowledgeable Self-awareness
-
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
-
ARIES: Stimulating Self-Refinement of Large Language Models with and for Iterative Preference Optimization
-
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
-
Automating Evaluation of Creativity in LLMs with Semantic Entropy and Efficient Multi-Agent Judge
-
AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind
-
Benchmarking Agentic Workflow Generation
-
BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation
-
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
-
Can Large Language Models Reason? A Characterization via 3-SAT
-
Chain-of-Thought Reasoning in the Wild is not Always Faithful
-
Chain-of-Timeline: Enhancing LLM Zero-Shot Temporal Reasoning with SQL-Style Timeline Formalization
-
CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
-
Cutting Through the Noise: Boosting LLM Performance on Math Word Problems
-
Decoupling the components of geometric understanding
-
DEDUCE: DEDUCTIVE CONSISTENCY AS A FRAMEWORK TO EVALUATE LLM REASONING
-
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels
-
Disentangling Exploration of Large Language Models by Optimal Exploitation
-
Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning
-
Diving into Self-Evolve Training for Multimodal Reasoning
-
EcoAct: Economic Agent Determines When to Register What Action
-
Enhancing Mathematical Reasoning in Language Models Through Focused Differentiation Training
-
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
-
Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies in Vision-Language Models
-
Feedback-Aware Monte Carlo Tree Search for Efficient Information Seeking in Goal-Oriented Conversations
-
FLEX-TRAVELPLANNER: A BENCHMARK FOR FLEXIBLE PLANNING WITH LANGUAGE AGENTS
-
GRAPE: Generalizing Robot Policy via Preference Alignment
-
IGDA: INTERACTIVE GRAPH DISCOVERY THROUGH LARGE LANGUAGE MODEL AGENTS
-
Implicit Language Models are RNNs: Balancing Parallelization and Expressivity
-
Improving Test-Time Search for LLMs with Backtracking Against In-Context Value Verifiers
-
InductionBench: LLMs Fail in the Simplest Complexity Class
-
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
-
Language Models Use Trigonometry to Do Addition
-
Large Language Model-Enhanced Multi-Armed Bandits
-
Large Language Models to Diffusion Finetuning
-
Learning to Defer for Causal Discovery with Imperfect Experts
-
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation
-
Limits of Deep Learning: Sequence Modeling through the Lens of Complexity Theory
-
LLMs Are Not Good Strategists, Yet Memory-Enhanced Agency Boosts Reasoning
-
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
-
LM2: Large Memory Models for Long Context Reasoning
-
LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations
-
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
-
LogitGaze: Predicting Human Attention Using Semantic Information from Vision-Language Models
-
LookPlanGraph: Embodied instruction following method with VLM graph augmentation
-
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
-
MALT: Improving Reasoning with Multi-Agent LLM Training
-
MAS-GPT: Training LLMs To Build LLM-Based Multi-Agent Systems
-
MastermindEval: A Simple But Scalable Reasoning Benchmark
-
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
-
Meta-Prompt Optimization for LLM-Based Sequential Decision Making
-
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
-
MINDSTORES: Memory-Informed Neural Decision Synthesis for Task-Oriented Reinforcement in Embodied Systems
-
MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning
-
MMCode: Benchmarking Multimodal Large Language Models in Code Generation with Visually Rich Programming Problems
-
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (Abridged)
-
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
-
Multi-Turn Code Generation Through Single-Step Rewards
-
Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration
-
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
-
Offline Reinforcement Learning for LLM Multi-Step Reasoning
-
On the Language of Thoughts in Large Language Models
-
Optimizing Test-Time Compute via Meta Reinforcement Finetuning
-
PC-Agent: A Hierarchical Agentic Framework for Complex Task Automation on PC
-
PDE-Controller: LLMs for Autoformalization and Reasoning of PDEs
-
PHYSICS: Benchmarking Foundation Models for Problem Solving in Physics
-
Plan$^\ast$RAG: Efficient Test-Time Planning for Retrieval Augmented Generation
-
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search
-
Rationalization Models for Text-to-SQL
-
Re-Imagine: Symbolic Benchmark Synthesis for Reasoning Evaluation
-
Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs
-
Reasoning3D - Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models
-
Refining Answer Distributions for Improved Large Language Model Reasoning
-
Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations
-
Resolving Ambiguity through Personalization in LLM chat systems
-
Rethinking Fine-tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning
-
Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms
-
Revealing chemical reasoning in LLMs through search on complex planning tasks
-
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification
-
RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner
-
RuleArena: A Benchmark for LLM Rule-Guided Reasoning in Real-World Scenarios
-
s1: Simple test-time scaling
-
Scaling Flaws of Verifier-guided Search in Mathematical Reasoning
-
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
-
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
-
Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
-
SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning
-
Spectral Journey: How Transformers Predict the Shortest Path
-
StochasTok: Improving Fine-Grained Subword Understanding in LLMs
-
Strategic LLM Decoding through Bayesian Games
-
TACO: Learning Multi-modal Models to Reason and Act with Synthetic Chains-of-Thought-and-Action
-
Teaching Transformers Causal Reasoning through Axiomatic Training
-
The in-context inductive biases of vision-language models differ across modalities
-
Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization
-
Think to Ground: Improving Spatial Reasoning in LLMs for better Visual Grounding
-
Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners
-
Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization
-
Training Large Language Models to Reason in a Continuous Latent Space
-
TRIG-Bench: A Benchmark for Text-Rich Image Grounding
-
Understanding Financial Reasoning in AI: A Multimodal Benchmark and Error Learning Approach
-
UNDERSTANDING INFERENCE SCALING LAWS FOR MIXTURES OF LLMS
-
Understanding Reasoning in Thinking Language Models via Steering Vectors
-
Unraveling Arithmetic in Large Language Models: The Role of Algebraic Structures
-
Value-Based Deep RL Scales Predictably
-
WebWalker: Benchmarking LLMs in Web Traversal
-
When Debate Fails: Bias Reinforcement in Large Language Models
-
When More is Less: Understanding Chain-of-Thought Length in LLMs