NeurIPS 2025 Past Math & reasoningLarge language models
First Workshop on Foundations of Reasoning in Language Models
FoRLM 2025
- Submission deadline
- Sep 9, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (101)
Fetched from OpenReview (v2) on 2026-06-10.
-
"Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning"
-
ActivationReasoning: Logical Reasoning in Latent Activation Spaces
-
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning
-
ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models
-
Asking the Missing Piece: Context-Driven Clarification for Ambiguous VQA
-
Benchmarking Temporal Reasoning: Can Large Language Models Navigate Time When Stories Refuse to Follow a Straight Line?
-
Benefits and Limitations of Communication in Multi-Agent Reasoning
-
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
-
Beyond Introspection: Reinforcing Thinking via Externalist Behavioral Feedback
-
Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries
-
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
-
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability
-
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
-
CARE: Turning LLMs Into Causal Reasoning Expert
-
CaRT: Teaching LLM Agents to Know When They Know Enough
-
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
-
Characterizing Deep Research: A Benchmark and Formal Definition
-
Characterizing good teachers for distillation using gradient features
-
Correct Reasoning Paths Visit Shared Decision Pivots
-
COSMIR: Chain Orchestrated Structured Memory for Iterative Reasoning over Long Context
-
Data Diversification Methods In Alignment Enhance Math Performance In LLMs
-
Decoupling the "What" and "Where" With Polar Coordinate Positional Embedding
-
Deep sequence models tend to memorize geometrically, we do not understand why.
-
Deep Thinking via Recursive Self-Aggregation
-
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
-
Diagnosing Moral Reasoning: A Benchmark for Evaluating Consistency and Robustness in Large Language Models
-
Discerning What Matters: A Multi-Dimensional Assessment of Moral Competence in LLMs
-
Dormant Reasoning Circuits in RL-Trained Language Models
-
EAT: Entropy After $\langle/ \tt Think \rangle$ for reasoning model early exiting
-
Efficient First-Order Logic-Based Method for Enhancing Logical Reasoning Capabilities of LLMs
-
Executable Counterfactuals: Improving LLMs' Causal Reasoning Through Code
-
Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches
-
Exploring System 1 and 2 communication for latent reasoning in LLMs
-
Fathom-Search-4B: Unlocking Long-Horizon DeepSearch via RL
-
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
-
FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness
-
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
-
GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning
-
Grounding LLM Reasoning with Knowledge Graphs
-
Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive Language Models
-
Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis
-
How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients
-
How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models
-
Influence Functions for Efficient Data Selection in Reasoning
-
Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction
-
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
-
Is Random Attention Sufficient for Sequence Modeling?
-
Label-Invariant Hessian Regularization Mitigates Grokking in Mathematical Reasoning
-
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
-
Language Models That Think, Chat Better
-
Learning Composable Chains-of-Thought
-
Learning to Answer from Correct Demonstrations
-
Limits of Emergent Reasoning of Large Language Models in Agentic Frameworks for Deterministic Games
-
LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess
-
Lost at the Beginning of Reasoning
-
MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning
-
Monitor-Generate-Verify (MGV): Formalising Metacognitive Theory for Language Model Reasoning
-
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
-
Multiple Token Divergence: A Measure of In-Context Computation Density
-
Murphy: Reflective Multi-Turn Reinforcement Learning for Self-Correcting Code Generation in Large Language
-
Not All Thoughts Matter: Selective Attention for Efficient Reasoning
-
Observer, Not Player: Simulating Theory of Mind in Large Language Models through Game Observation
-
OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics
-
On the generalization of language models from in-context learning and finetuning: a controlled study
-
On the Optimization Dynamics of RLVR: Gradient Gap and Step Size Scaling
-
On the Role of Temperature Sampling in Test-Time Scaling
-
OpenThoughts: Data Recipes for Reasoning Models
-
Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs
-
POLYMATH: A Challenging Multi-modal Mathematical Reasoning Benchmark
-
R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search
-
Reasoning Large Language Model Errors Arise from Hallucinating Critical Problem Features
-
Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning
-
Reasoning Up the Instruction Ladder for Controllable Language Models
-
Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets
-
ReFIne: A Framework for Trustworthy Large Reasoning Models with Reliability, Faithfulness, and Interpretability
-
Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs
-
Representational Homomorphism Error Predicts Compositional Generalization In Language Models
-
Reverse-KL Reinforcement Learning Can Sample From Multiple Diverse Modes
-
RL Fine-Tuning Heals the OOD Forgetting in SFT
-
RL in Name Only? Analyzing the Structural Assumptions in RL post-training for LLMs
-
RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs
-
Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
-
Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models
-
Skill-Targeted Adaptive Training
-
SLR: Automated Synthesis for Scalable Logical Reasoning
-
Smaller Models, Smarter Rewards: A Two-Sided Approach to Process and Outcome Rewards
-
Steering LLMs’ Reasoning With Activation State Machines
-
T-FIX: Text-Based Explanations with Features Interpretable to eXperts
-
TATTO: Tool-Augmented Thinking PRM for Tabular Reasoning
-
Test-Time Alignment for Large Language Models via Textual Model Predictive Control
-
The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs
-
To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models
-
TokUR: Token-Level Uncertainty Estimation for Large Language Mode Reasoning
-
Towards Understanding the Optimization Landscape of GRPO and its Variants
-
Tracing the Traces: Latent-Space Metrics for Efficient and Accurate Reasoning
-
TS-Agent: A Time Series Reasoning Agent with Iterative Statistical Insight Gathering
-
Understanding the Test-Time Computing of Transformers: A Theoretical Study on In-Context Linear Regression
-
UQ: Assessing Language Models on Unsolved Questions
-
Variation in Verification: Understanding Verification Dynamics in Large Language Models
-
Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
-
Weak-to-Strong Generalization with Failure Trajectories