ICML 2025PastOther

ICML 2025 Workshop on Assessing World Models

ICML 2025 World Models Workshop

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: May 22, 2025, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (36)

Fetched from OpenReview (v2) on 2026-06-10.

Adapting Vision-Language Models for Evaluating World Models
Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot · PDF
APOD: Adaptive PDE-Observation Diffusion for Physics-Constrained Sampling
Ruichen Xu, Haochun Wang, Georgios Kementzidis, Chenhao Si, Yuefan Deng · PDF
Aquilon: Towards Building Multimodal Weather LLMs
Sumanth Varambally, Veeramakali Vignesh Manivannan, Yasaman Jafari, Luyu Han, Zachary Novack, Zhirui Xia, Salva Rühling Cachay, Srikar Eranky, Ruijia Niu, Taylor Berg-Kirkpatrick, Duncan Watson-Parris, Yian Ma, Rose Yu · PDF
Are LLM Belief Updates Consistent with Bayes’ Theorem?
Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella, Rob Lamb, Peter M. Atkinson · PDF
Beyond Behavioural Evaluations for Assessing World Models
Kola Ayonrinde · PDF
Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning
Sultan AlRashed, Jianghui Wang, Francesco Orabona · PDF
Contextual Effects in LLM and Human Causal Reasoning
Zach Studdiford, Gary Lupyan · PDF
Deep Koopman operator framework for causal discovery in nonlinear dynamical systems
Juan Nathaniel, Carla Roesch, Jatan Buch, Derek DeSantis, Adam Rupe, Kara D Lamb, Pierre Gentine · PDF
Do Vision Language Models infer human intention without visual perspective-taking? Towards a scalable "One-Image-Probe-All" dataset
Bingyang Wang, Yijiang Li, Qingyang Zhou, Hui Yi Leong, Tianwei Zhao, Letian Ye, Hokin Deng, Dezhi Luo, Nuno Vasconcelos · PDF
Eliminating Discriminative Shortcuts in Multiple Choice Evaluations with Answer Matching
Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping · PDF
Evaluating Forecasting is More Difficult than Other LLM Evaluations
Daniel Paleka, Shashwat Goel, Jonas Geiping, Florian Tramèr · PDF
Evaluating Self-Orienting in Language and Reasoning Models
Eric J Bigelow, Zergham Ahmed, Tomer Ullman · PDF
FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models
Likun Tan, Kuan-Wei Huang, Kevin Wu · PDF
GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning
Sahiti Yerramilli, Nilay Pande, Jayant Sravan Tamarapalli, Rynaa Grover · PDF
HueManity: Probing Fine-Grained Visual Perception in MLLMs
Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande · PDF
I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2
Oliver McLaughlin, Jack Merullo, Arjun Khurana · PDF
Let’s Simulate Frame-by-Frame: In-Context Physical Simulations with Vision-Language Models
YingQiao Wang, Eric J Bigelow, Tomer Ullman · PDF
Leveraging the Sequential Nature of Language for Interpretability
Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Flavio Calmon, Himabindu Lakkaraju · PDF
Measuring Belief Updates in Curious Agents
Joschka Strüber, Ilze Amanda Auzina, Shashwat Goel, Susanne Keller, Jonas Geiping, Ameya Prabhu, Matthias Bethge · PDF
Measuring Rule-Following in Language Models
· PDF
MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models
Vanya Cohen, Ray Mooney · PDF
Newfluence: Boosting Model Interpretability and Understanding in High Dimensions
Haolin Zou, Arnab Auddy, Yongchan Kwon, Kamiar Rahnama Rad, Arian Maleki · PDF
On the Emergence of "Useless" Features in Next Token Predictors
Mark Rofin, Jalal Naghiyev, Michael Hahn · PDF
Open World Scene Graph Generation using Vision Language Models
· PDF
Probing the Limits of Mathematical World Models in LLMs
Henry Kvinge, Elizabeth Coda, Eric Yeats, Davis Brown, John Buckheit, Sarah McGuire Scullen, Brendan Kennedy, Loc Truong, William Kay, Cliff Joslyn, Tegan Emerson, Michael J. Henry, John Anthony Emanuello · PDF
ReviseQA: A Benchmark for Belief Revision in Multi-Turn Logical Reasoning
Chadi Helwe, Sultan AlRashed, Francesco Orabona · PDF
RMA: Reward Model Alignment with Human preference
Ashish Gupta, Manjunatha Naik MC · PDF
Testing LLM Understanding of Scientific Literature through Expert-Driven Question Answering: Insights from High-Temperature Superconductivity
Haoyu Guo, Maria Tikhanovskaya, Paul Raccuglia, Alexey Vlaskin, Chris Co, Daniel J. Liebling, Scott Ellsworth, Matthew Abraham, Elizabeth Dorfman, Peter Armitage, John Tranquada, Senthil Todadri, Antoine Georges, Subir Sachdev, Steven Kivelson, Brad Ramshaw, Dominik Kiese, Chunhan Feng, Olivier Gingras, Vadim Oganesyan, Michael Brenner, Subhashini Venugopalan, Eun-Ah Kim · PDF
Tracking World States with Language Models: State-Based Evaluation Using Chess
Romain Harang, Jason Naradowsky, Yaswitha Gujju, Yusuke Miyao · PDF
Unbounded Memory and Consistent Imagination via Unified Diffusion–SSM World Models
Jia-Hua Lee, Bor-Jiun Lin, Wei-Fang Sun, Chun-Yi Lee · PDF
Uncertainty Quantification for LLM-Based Survey Simulations
Chengpiao Huang, Yuhang Wu, Kaizheng Wang · PDF
Understanding Large Language Models' Ability on Interdisciplinary Research
Yuanhao Shen, Daniel Xavier de Sousa, Ricardo Marçal de Andrade Nascimento, Ali Asad, Hongyu Guo, Xiaodan Zhu · PDF
Virtue Semantics: Probing the Consistency of Moral Values of Large Language Models
Em Smullen, Srihari Thirumaligai, Anna Leshinskaya · PDF
What if Othello-Playing Language Models Could See?
Xinyi Chen, Yifei Yuan, Jiaang Li, Serge Belongie, Maarten de Rijke, Anders Søgaard · PDF
World Models and Consistent Mistakes in LLMs
Christopher Wolfram, Aaron Schein · PDF
WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning
Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, Pascale Fung · PDF

Accepted papers (36)

☆Adapting Vision-Language Models for Evaluating World Models

☆APOD: Adaptive PDE-Observation Diffusion for Physics-Constrained Sampling

☆Aquilon: Towards Building Multimodal Weather LLMs

☆Are LLM Belief Updates Consistent with Bayes’ Theorem?

☆Beyond Behavioural Evaluations for Assessing World Models

☆Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning

☆Contextual Effects in LLM and Human Causal Reasoning

☆Deep Koopman operator framework for causal discovery in nonlinear dynamical systems

☆Do Vision Language Models infer human intention without visual perspective-taking? Towards a scalable "One-Image-Probe-All" dataset

☆Eliminating Discriminative Shortcuts in Multiple Choice Evaluations with Answer Matching

☆Evaluating Forecasting is More Difficult than Other LLM Evaluations

☆Evaluating Self-Orienting in Language and Reasoning Models

☆FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models

☆GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

☆HueManity: Probing Fine-Grained Visual Perception in MLLMs

☆I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2

☆Let’s Simulate Frame-by-Frame: In-Context Physical Simulations with Vision-Language Models

☆Leveraging the Sequential Nature of Language for Interpretability

☆Measuring Belief Updates in Curious Agents

☆Measuring Rule-Following in Language Models

☆MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

☆Newfluence: Boosting Model Interpretability and Understanding in High Dimensions

☆On the Emergence of "Useless" Features in Next Token Predictors

☆Open World Scene Graph Generation using Vision Language Models

☆Probing the Limits of Mathematical World Models in LLMs

☆ReviseQA: A Benchmark for Belief Revision in Multi-Turn Logical Reasoning

☆RMA: Reward Model Alignment with Human preference

☆Testing LLM Understanding of Scientific Literature through Expert-Driven Question Answering: Insights from High-Temperature Superconductivity

☆Tracking World States with Language Models: State-Based Evaluation Using Chess

☆Unbounded Memory and Consistent Imagination via Unified Diffusion–SSM World Models

☆Uncertainty Quantification for LLM-Based Survey Simulations

☆Understanding Large Language Models' Ability on Interdisciplinary Research

☆Virtue Semantics: Probing the Consistency of Moral Values of Large Language Models

☆What if Othello-Playing Language Models Could See?

☆World Models and Consistent Mistakes in LLMs

☆WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Adapting Vision-Language Models for Evaluating World Models

APOD: Adaptive PDE-Observation Diffusion for Physics-Constrained Sampling

Aquilon: Towards Building Multimodal Weather LLMs

Are LLM Belief Updates Consistent with Bayes’ Theorem?

Beyond Behavioural Evaluations for Assessing World Models

Cards Against Contamination: TCG-Bench for Difficulty-Scalable Multilingual LLM Reasoning

Contextual Effects in LLM and Human Causal Reasoning

Deep Koopman operator framework for causal discovery in nonlinear dynamical systems

Do Vision Language Models infer human intention without visual perspective-taking? Towards a scalable "One-Image-Probe-All" dataset

Eliminating Discriminative Shortcuts in Multiple Choice Evaluations with Answer Matching

Evaluating Forecasting is More Difficult than Other LLM Evaluations

Evaluating Self-Orienting in Language and Reasoning Models

FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models

GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

HueManity: Probing Fine-Grained Visual Perception in MLLMs

I Have No Mouth, and I Must Rhyme: Uncovering Internal Phonetic Representations in LLaMA 3.2

Let’s Simulate Frame-by-Frame: In-Context Physical Simulations with Vision-Language Models

Leveraging the Sequential Nature of Language for Interpretability

Measuring Belief Updates in Curious Agents

Measuring Rule-Following in Language Models

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

Newfluence: Boosting Model Interpretability and Understanding in High Dimensions

On the Emergence of "Useless" Features in Next Token Predictors

Open World Scene Graph Generation using Vision Language Models

Probing the Limits of Mathematical World Models in LLMs

ReviseQA: A Benchmark for Belief Revision in Multi-Turn Logical Reasoning

RMA: Reward Model Alignment with Human preference

Testing LLM Understanding of Scientific Literature through Expert-Driven Question Answering: Insights from High-Temperature Superconductivity

Tracking World States with Language Models: State-Based Evaluation Using Chess

Unbounded Memory and Consistent Imagination via Unified Diffusion–SSM World Models

Uncertainty Quantification for LLM-Based Survey Simulations

Understanding Large Language Models' Ability on Interdisciplinary Research

Virtue Semantics: Probing the Consistency of Moral Values of Large Language Models

What if Othello-Playing Language Models Could See?

World Models and Consistent Mistakes in LLMs

WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning