ICLR 2025 Past Math & reasoningLarge language models

Workshop on Reasoning and Planning for Large Language Models

LLM_Reason_and_Plan

Submission deadline
Feb 9, 2025, 21:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (110)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Simple Model of Inference Scaling Laws

    Noam Itzhak Levi · PDF
  2. Adaptive Self-improvement LLM Agentic System for ML Library Development

    Genghan Zhang, Weixin Liang, Olivia Hsu, Kunle Olukotun · PDF
  3. Advancing Multimodal In-Context Learning in Large Vision-Language Models with Task-aware Demonstrations

    Yanshu Li · PDF
  4. Agentic Knowledgeable Self-awareness

    Shuofei Qiao, Zhisong Qiu, Baochang Ren, Xiaobin Wang, Xiangyuan Ru, Ningyu Zhang, Xiang Chen, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen · PDF
  5. Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong · PDF
  6. ARIES: Stimulating Self-Refinement of Large Language Models with and for Iterative Preference Optimization

    Yongcheng Zeng, Xuanfa Jin, Guoqing Liu, Quan He, Dong Li, Jianye HAO, Haifeng Zhang, Jun Wang · PDF
  7. Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

    Fangru Lin, Shaoguang Mao, Emanuele La Malfa, Valentin Hofmann, Adrian de Wynter, Xun Wang, Si-Qing Chen, Michael J. Wooldridge, Janet B. Pierrehumbert, Furu Wei · PDF
  8. Automating Evaluation of Creativity in LLMs with Semantic Entropy and Efficient Multi-Agent Judge

    Tan Min Sen, Zachary Choy Kit Chun, Swaagat Bikash Saikia, Syed Ali Redha Alsagoff, Banerjee Mohor, Nadya Yuki Wangsajaya, Alvin Chan · PDF
  9. AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind

    Zhining Zhang, Chuanyang Jin, Mung Yao Jia, Tianmin Shu · PDF
  10. Benchmarking Agentic Workflow Generation

    Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen · PDF
  11. BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation

    Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, Caiming Xiong · PDF
  12. Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

    Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou · PDF
  13. Can Large Language Models Reason? A Characterization via 3-SAT

    RISHI HAZRA, Gabriele Venturato, Pedro Zuidberg Dos Martires, Luc De Raedt · PDF
  14. Chain-of-Thought Reasoning in the Wild is not Always Faithful

    Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy · PDF
  15. Chain-of-Timeline: Enhancing LLM Zero-Shot Temporal Reasoning with SQL-Style Timeline Formalization

    Jiaying Wu, Bryan Hooi · PDF
  16. CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

    Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan · PDF
  17. Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

    Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra · PDF
  18. Decoupling the components of geometric understanding

    Eliza Kosoy, Annya Dahmani, Andrew Kyle Lampinen, Iulia Maria Comsa, Soojin Jeong, Ishita Dasgupta, Kelsey R Allen · PDF
  19. DEDUCE: DEDUCTIVE CONSISTENCY AS A FRAMEWORK TO EVALUATE LLM REASONING

    Atharva Pandey, Kshitij Dubey, Rahul Sharma, Amit Sharma · PDF
  20. DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

    Zhe Xu, Jiasheng Ye, Xiaoran Liu, Xiangyang Liu, Tianxiang Sun, Zhigeng Liu, Qipeng Guo, Linlin Li, Qun Liu, Xuanjing Huang, Xipeng Qiu · PDF
  21. Disentangling Exploration of Large Language Models by Optimal Exploitation

    Tim Grams, Patrick Betz, Christian Bartelt · PDF
  22. Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning

    Chengsong Huang, Langlin Huang, Jiaxin Huang · PDF
  23. Diving into Self-Evolve Training for Multimodal Reasoning

    Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He · PDF
  24. EcoAct: Economic Agent Determines When to Register What Action

    Shaokun Zhang, Jieyu Zhang, Dujian Ding, Jiale Liu, Mirian Del Carmen Hipolito Garcia, Ankur Mallick, Daniel Madrigal, Menglin Xia, Victor Rühle, Qingyun Wu, Chi Wang · PDF
  25. Enhancing Mathematical Reasoning in Language Models Through Focused Differentiation Training

    Zhiyu Zhao, Yongcheng Zeng, Ning Yang, Zihan Zhao, Haifeng Zhang, Jun Wang, Guoqing Liu · PDF
  26. ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

    Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, BOYAN LI, Jiamin Su, Xiong Gao, YiFan Zhang, Tianlong Xu, Zhendong Chu, Aoxiao Zhong, Kun Wang, Hui Xiong, Philip S. Yu, Xuming Hu, Qingsong Wen · PDF
  27. Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies in Vision-Language Models

    Sid Bharthulwar, John Rho, Katrina Brown · PDF
  28. Feedback-Aware Monte Carlo Tree Search for Efficient Information Seeking in Goal-Oriented Conversations

    Harshita Chopra, Chirag Shah · PDF
  29. FLEX-TRAVELPLANNER: A BENCHMARK FOR FLEXIBLE PLANNING WITH LANGUAGE AGENTS

    Juhyun Oh, Eunsu Kim, Alice Oh · PDF
  30. GRAPE: Generalizing Robot Policy via Preference Alignment

    Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, Huaxiu Yao · PDF
  31. IGDA: INTERACTIVE GRAPH DISCOVERY THROUGH LARGE LANGUAGE MODEL AGENTS

    Alexander Havrilla, David Alvarez-Melis, Nicolo Fusi · PDF
  32. Implicit Language Models are RNNs: Balancing Parallelization and Expressivity

    Mark Schöne, Babak Rahmani, Heiner Kremer, Fabian Falck, Hitesh Ballani, Jannes Gladrow · PDF
  33. Improving Test-Time Search for LLMs with Backtracking Against In-Context Value Verifiers

    Anikait Singh, Kushal Arora, Sedrick Keh, Jean Mercat, Tatsunori Hashimoto, Chelsea Finn, Aviral Kumar · PDF
  34. InductionBench: LLMs Fail in the Simplest Complexity Class

    Wenyue Hua, Fei Sun, Liangming Pan, Adam Jardine, William Yang Wang · PDF
  35. Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models

    Zhanke Zhou, Xuan Li, Zhaocheng Zhu, Mikhail Galkin, Xiao Feng, Sanmi Koyejo, Jian Tang, Bo Han · PDF
  36. Language Models Use Trigonometry to Do Addition

    Subhash Kantamneni, Max Tegmark · PDF
  37. Large Language Model-Enhanced Multi-Armed Bandits

    Jiahang Sun, Zhiyong Wang, Runhan Yang, Chenjun Xiao, John C.S. Lui, Zhongxiang Dai · PDF
  38. Large Language Models to Diffusion Finetuning

    Edoardo Cetin, Tianyu Zhao, Yujin Tang · PDF
  39. Learning to Defer for Causal Discovery with Imperfect Experts

    Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin · PDF
  40. LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

    Xuan Zhang, Fengzhuo Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin · PDF
  41. Limits of Deep Learning: Sequence Modeling through the Lens of Complexity Theory

    Nikola Zubic, Federico Soldà, Aurelio Sulser, Davide Scaramuzza · PDF
  42. LLMs Are Not Good Strategists, Yet Memory-Enhanced Agency Boosts Reasoning

    Yi Wu, Zhimin Hu · PDF
  43. LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

    Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel · PDF
  44. LM2: Large Memory Models for Long Context Reasoning

    Jikun Kang, Wenqi Wu, Filippos Christianos, Alex James Chan, Fraser David Greenlee, George Thomas, Marvin Purtorab, Andrew Toulis · PDF
  45. LMAct: A Benchmark for In-Context Imitation Learning with Long Multimodal Demonstrations

    Anian Ruoss, Fabio Pardo, Harris Chan, Bonnie Li, Volodymyr Mnih, Tim Genewein · PDF
  46. Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving

    Sara Rajaee, Kumar Pratik, Gabriele Cesa, Arash Behboodi · PDF
  47. LogitGaze: Predicting Human Attention Using Semantic Information from Vision-Language Models

    Dmitry Lvov, Ilya Pershin · PDF
  48. LookPlanGraph: Embodied instruction following method with VLM graph augmentation

    Anatoly Onishchenko, Alexey Kovalev, Aleksandr Panov · PDF
  49. Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs

    Rohit Saxena, Aryo Pradipta Gema, Pasquale Minervini · PDF
  50. MALT: Improving Reasoning with Multi-Agent LLM Training

    Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip Torr, Fabio Pizzati, Ronald Clark, Christian Schroeder de Witt · PDF
  51. MAS-GPT: Training LLMs To Build LLM-Based Multi-Agent Systems

    Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Jing Shao, Siheng Chen · PDF
  52. MastermindEval: A Simple But Scalable Reasoning Benchmark

    Jonas Golde, Patrick Haller, Fabio Barth, Alan Akbik · PDF
  53. MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

    Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, Mengdi Wang · PDF
  54. Meta-Prompt Optimization for LLM-Based Sequential Decision Making

    Mingze Kong, Zhiyong Wang, Yao Shu, Zhongxiang Dai · PDF
  55. MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

    Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma · PDF
  56. MINDSTORES: Memory-Informed Neural Decision Synthesis for Task-Oriented Reinforcement in Embodied Systems

    Anirudh Chari, Suraj Marpadga Reddy, Aditya Tiwari, Richard Lian, Brian Lee Zhou · PDF
  57. MIR-Bench: Benchmarking LLM's Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning

    Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen · PDF
  58. MMCode: Benchmarking Multimodal Large Language Models in Code Generation with Visually Rich Programming Problems

    Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, Jing Ma · PDF
  59. Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (Abridged)

    Shalev Lifshitz, Sheila A. McIlraith, Yilun Du · PDF
  60. Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

    Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li · PDF
  61. Multi-Turn Code Generation Through Single-Step Rewards

    Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury · PDF
  62. Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration

    Qinglin Zhu, Runcong Zhao, Hanqi Yan, Yulan He, Yudong Chen, Lin Gui · PDF
  63. OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

    Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, James Zou · PDF
  64. Offline Reinforcement Learning for LLM Multi-Step Reasoning

    Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, Yi Wu · PDF
  65. On the Language of Thoughts in Large Language Models

    Chenxi Liu, Yongqiang Chen, Tongliang Liu, James Cheng, Bo Han, Kun Zhang · PDF
  66. Optimizing Test-Time Compute via Meta Reinforcement Finetuning

    Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar · PDF
  67. PC-Agent: A Hierarchical Agentic Framework for Complex Task Automation on PC

    Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, Fei Huang · PDF
  68. PDE-Controller: LLMs for Autoformalization and Reasoning of PDEs

    Mauricio Soroco, Jialin Song, Mengzhou Xia, Kye Emond, Weiran Sun, Wuyang Chen · PDF
  69. PHYSICS: Benchmarking Foundation Models for Problem Solving in Physics

    Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, Arman Cohan · PDF
  70. Plan$^\ast$RAG: Efficient Test-Time Planning for Retrieval Augmented Generation

    Prakhar Verma, Sukruta Prakash Midigeshi, Gaurav Sinha, Arno Solin, Nagarajan Natarajan, Amit Sharma · PDF
  71. QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

    Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, Kai-Wei Chang · PDF
  72. Rationalization Models for Text-to-SQL

    Gaetano Rossiello, Nhan H Pham, Michael Glass, Junkyu Lee, Dharmashankar Subramanian · PDF
  73. Re-Imagine: Symbolic Benchmark Synthesis for Reasoning Evaluation

    Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Fabian Falck, Risa Ueno, Aditya V. Nori, Rahul Sharma, Amit Sharma, Javier Gonzalez · PDF
  74. Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs

    Benjamin Estermann, Roger Wattenhofer · PDF
  75. Reasoning3D - Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

    Tianrun Chen, Chunan Yu, Jing Li, Jianqi Zhang, Lanyun Zhu, Deyi Ji, Yong Zhang, Ying Zang, Lingyun Sun, Zejian Li · PDF
  76. Refining Answer Distributions for Improved Large Language Model Reasoning

    Soumyasundar Pal, Didier Chételat, Yingxue Zhang, Mark Coates · PDF
  77. Reinforcement Learning in Inference Time: A Perspective from Successive Policy Iterations

    Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang, Songtao Lu, Alfredo Garcia, Mingyi Hong · PDF
  78. Resolving Ambiguity through Personalization in LLM chat systems

    Sophia Huiwen Sun, Abishek Sankararaman, Balakrishnan Murali Narayanaswamy · PDF
  79. Rethinking Fine-tuning when Scaling Test-Time Compute: Limiting Confidence Improves Mathematical Reasoning

    Feng Chen, Allan Raventos, Nan Cheng, Surya Ganguli, Shaul Druckmann · PDF
  80. Reveal the Mystery of DPO: The Connection between DPO and RL Algorithms

    Xuerui Su, Yue Wang, Jinhua Zhu, Mingyang Yi, Feng Xu, Zhi-Ming Ma, Yuting Liu · PDF
  81. Revealing chemical reasoning in LLMs through search on complex planning tasks

    Andres M Bran, Théo A. Neukomm, Daniel P Armstrong, Zlatko Jončev, Philippe Schwaller · PDF
  82. ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification

    Hyunseok Lee, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim, Jinwoo Shin · PDF
  83. RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner

    Fu-Chieh Chang, Yu-Ting Lee, Hui-Ying Shih, Yi Hsuan Tseng, Pei-Yuan Wu · PDF
  84. RuleArena: A Benchmark for LLM Rule-Guided Reasoning in Real-World Scenarios

    Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang · PDF
  85. s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, Tatsunori Hashimoto · PDF
  86. Scaling Flaws of Verifier-guided Search in Mathematical Reasoning

    Fei Yu, Yingru Li, Benyou Wang · PDF
  87. Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

    Xiyao Wang, Zhengyuan Yang, Linjie Li, Hongjin Lu, Yuancheng Xu, Chung-Ching Lin, Kevin Lin, Furong Huang, Lijuan Wang · PDF
  88. ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

    Kaixin Li, Meng ziyang, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua · PDF
  89. Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst

    Hongru WANG, Deng Cai, Wanjun Zhong, Shijue Huang, Jeff Z. Pan, Zeming Liu, Kam-Fai Wong · PDF
  90. SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning

    Wanjia Zhao, Mert Yuksekgonul, Shirley Wu, James Zou · PDF
  91. Spectral Journey: How Transformers Predict the Shortest Path

    Andrew Cohen, Andrey Gromov, Kaiyu Yang, Yuandong Tian · PDF
  92. StochasTok: Improving Fine-Grained Subword Understanding in LLMs

    Anya Sims, Cong Lu, Klara Kaleb, Jakob Nicolaus Foerster, Yee Whye Teh · PDF
  93. Strategic LLM Decoding through Bayesian Games

    Weitong Zhang, Chengqi Zang, Bernhard Kainz · PDF
  94. TACO: Learning Multi-modal Models to Reason and Act with Synthetic Chains-of-Thought-and-Action

    Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, silvio savarese · PDF
  95. Teaching Transformers Causal Reasoning through Axiomatic Training

    Aniket Vashishtha, Abhinav Kumar, Atharva Pandey, Abbavaram Gowtham Reddy, Kabir Ahuja, Vineeth N. Balasubramanian, Amit Sharma · PDF
  96. The in-context inductive biases of vision-language models differ across modalities

    Kelsey R Allen, Eliza Kosoy, Ishita Dasgupta, Andrew Kyle Lampinen · PDF
  97. Think Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization

    Zishun Yu, Tengyu Xu, Di Jin, Karthik Abinav Sankararaman, Yun He, Wenxuan Zhou, Zhouhao Zeng, Eryk Helenowski, Chen Zhu, Sinong Wang, Hao Ma, Han Fang · PDF
  98. Think to Ground: Improving Spatial Reasoning in LLMs for better Visual Grounding

    Karun Sharma, Vidushee Vats · PDF
  99. Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners

    Daniele Paliotta, Junxiong Wang, Matteo Pagliardini, Kevin Li, Aviv Bick, Albert Gu, François Fleuret, Tri Dao · PDF
  100. Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization

    Yuchi Liu, Jaskirat Singh, Gaowen Liu, Ali Payani, Liang Zheng · PDF
  101. Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason E Weston, Yuandong Tian · PDF
  102. TRIG-Bench: A Benchmark for Text-Rich Image Grounding

    Ming Li, Ruiyi Zhang, Jian Chen, Tianyi Zhou · PDF
  103. Understanding Financial Reasoning in AI: A Multimodal Benchmark and Error Learning Approach

    SHUANGYAN DENG, Haizhou Peng, Jiachen Xu, Chunhou Liu, Ciprian Doru Giurcaneanu, Jiamou Liu · PDF
  104. UNDERSTANDING INFERENCE SCALING LAWS FOR MIXTURES OF LLMS

    Alexander Havrilla, Srishti Gureja · PDF
  105. Understanding Reasoning in Thinking Language Models via Steering Vectors

    Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda · PDF
  106. Unraveling Arithmetic in Large Language Models: The Role of Algebraic Structures

    Fu-Chieh Chang, You-Chen Lin, Pei-Yuan Wu · PDF
  107. Value-Based Deep RL Scales Predictably

    Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Victor Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar · PDF
  108. WebWalker: Benchmarking LLMs in Web Traversal

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang · PDF
  109. When Debate Fails: Bias Reinforcement in Large Language Models

    Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun · PDF
  110. When More is Less: Understanding Chain-of-Thought Length in LLMs

    Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, Yisen Wang · PDF