ICLR 2026 Past Multimodal

ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction & Beyond

ICLR 2026 Workshop MM Intelligence

Submission deadline
Feb 6, 2026, 13:00 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (60)

Fetched from OpenReview (v2) on 2026-06-10.

  1. A Single Image and Multimodality Is All You Need for Novel View Synthesis

    Amirhosein Javadi, Chi-Shiang Gau, Konstantinos D. Polyzos, Tara Javidi · PDF
  2. A Systematic Study of Behavioral Cloning for Scientific Data Annotation

    Ishaan Singh Chandok, Core Francisco Park · PDF
  3. AdaTS: Adaptive Token Sampling for Efficient Speech Language Models

    Sonal Sannigrahi, Giuseppe Attanasio, Andre Martins · PDF
  4. An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

    Georgios Pantazopoulos, Eda B. Ozyigit · PDF
  5. Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

    Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Jayeon Park, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, Rakshith Sharma Srinivasa · PDF
  6. BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

    Justin Deschenaux, Caglar Gulcehre · PDF
  7. Bridging Generative and Predictive Paradigms via Hidden-Self-Distillation

    Scott C. Lowe, Anthony Fuller, Sageev Oore, Evan Shelhamer, Graham W. Taylor · PDF
  8. Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

    Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic · PDF
  9. Can Vision Models Process Physiological Signals? Exploring Visual Tokenization as a Representation Interface

    Frida M. E. Westby, Li Meng, Anis Yazidi, Ali Ramezani-Kebrya · PDF
  10. CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

    Dong Hee Lee, Rui Cai, Zhe Zhao · PDF
  11. City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs

    Dwip Dalal, Utkarsh Mishra, Narendra Ahuja, Nebojsa Jojic · PDF
  12. CMRAG: Co-modality-based visual document retrieval and question answering

    Wang Chen, Wenhan Yu, Guanqiang QI, Weikang Li, Yang Li, Lei Sha, Deguo Xia, Jizhou Huang · PDF
  13. CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks

    Songqin Nong, Xiaoxuan Tang, Jingxuan Xu, Sheng Zhou, JianfengChen, Tao Jiang, Wenhao Xu · PDF
  14. Data Provenance for Image Auto-Regressive Generation

    Bihe Zhao, Louis Kerner, Michel Meintz, Tameem Bakr, Franziska Boenisch, Adam Dziedzic · PDF
  15. DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

    Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Tianren Zhang, Jing Liu, Haichuan Gao, Feng Chen · PDF
  16. Depth Over Specialization in Small Multimodal Transformers

    Jakub Mroz, Henry Ndubuaku · PDF
  17. Diagnosing the Curse: A Scale-Consistent and All-Phase Metric for Modality Bias in MLLMs

    Jinlin He, Chenfei Liao, Xu Zheng, Mengyu Jin, Xuming Hu · PDF
  18. DiffuMamba: High-Throughput Diffusion LMs with Mamba Backbone

    Vaibhav Singh, Oleksiy Ostapenko, Pierre-Andre Noel, Eugene Belilovsky, Torsten Scholak · PDF
  19. DISCO: Document Intelligence Suite for COmparative Evaluation

    Kenza Benkirane, Martin Asenov, Daniel Goldwater, Aneiss Ghodsi · PDF
  20. Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study

    Yixiao Jing, Chaoyu Zhang, Zixuan Zhong, Peizhou Huang · PDF
  21. Efficient Multimodal Generation via Redundancy-Aware Mixture-of-Experts

    Raman Dutt, Harleen Hanspal, Petru-Daniel Tudosiu, Alexander Black, Yongxin Yang, Steven McDonagh, Sarah Parisot · PDF
  22. ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

    Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li · PDF
  23. Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs

    Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric P. Xing, John Thickstun, Arash Vahdat · PDF
  24. Fine-Tuning Masked Diffusion for Provable Self-Correction

    Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z. Pan, Hyeji Kim, Sham M. Kakade, Sitan Chen · PDF
  25. GHVL: Geometry-Grounded Hyperbolic Vision-Language Models for Hierarchical Multimodal Representation Learning

    Kathy Wu, Sarthak Srivastava · PDF
  26. GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation

    Karim Elmaaroufi, Liheng Lai, Justin Svegliato, Yutong Bai, Sanjit A. Seshia, Matei Zaharia · PDF
  27. Growing Visual Generative Capacity for Pre-Trained MLLMs

    Hanyu Wang, Jiaming Han, Ziyan Yang, Qi Zhao, Shanchuan Lin, Xiangyu Yue, Abhinav Shrivastava, Zhenheng Yang, Hao Chen · PDF
  28. INDEX-PRESERVING LIGHTWEIGHT TOKEN PRUNING FOR EFFICIENT DOCUMENT UNDERSTANDING IN VISION-LANGUAGE MODELS

    Jaemin Son, Sujin Choi, Inyong Yun · PDF
  29. Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

    Hossein Shahabadi, Niki Sepasian, Arash Marioriyad, Ali Sharifi-Zarchi, Mahdieh Soleymani Baghshah · PDF
  30. Is Extending Modality The Right Path Towards Omni-Modality?

    Tinghui Zhu, Kai Zhang, Muhao Chen, Yu Su · PDF
  31. LanteRn: Latent Visual Structured Reasoning

    Andre G. Viveiros, Nuno Gonçalves, Matthias Lindemann, Andre Martins · PDF
  32. MapQA: A Map-Question-Answering Benchmark for Visual Language Model Reasoning

    Christian Michael Arnold, Andrew Alini, Jonathan Wang, Pieter M Feenstra, Conner Arnold, Jan DeWitt, Natalie C Ritsema, Jung Hyun Yae, Boris Katz, Andrei Barbu, Brian Cheung · PDF
  33. MLLMs are Deeply Affected by Modality Bias

    Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, Linfeng Zhang, Danda Pani Paudel, Xuanjing Huang, Yu-Gang Jiang, Nicu Sebe, Dacheng Tao, Luc Van Gool, Xuming Hu · PDF
  34. Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding

    Emmanouil Zaranis, António Farinhas, Saul Santos, Beatriz Canaverde, Miguel Moura Ramos, Wafaa Mohammed, Giuseppe Attanasio, Chrysoula Zerva, Nithin Sivakumaran, Shoubin Yu, Elena Bueno-Benito, Aditya K Surikuchi, Ben Peters, Danae Sanchez Villegas, Andre G. Viveiros, Pavlo Vasylenko, Baohao Liao, Sonal Sannigrahi, Jaehong Yoon, Elias Stengel-Eskin, Mariella Dimiccoli, Oswald Lanz, Alessandro Suglia, Mohit Bansal, Sandro Pezzelle, Stella Frank, Vlad Niculae, Desmond Elliott, Raffaella Bernardi, Raquel Fernández, Andre Martins · PDF
  35. Multimodal Language Models Cannot Spot Spatial Inconsistencies

    Om Khangaonkar, Hadi J. Rad, Hamed Pirsiavash · PDF
  36. Neural Signals Generate Clinical Notes in the Wild

    Jathurshan Pradeepkumar, Zheng Chen, Jimeng Sun · PDF
  37. Next-Scale Autoregression on Spectrograms for Sound Generation

    Eleonora Ristori, Luca Bindini, Paolo Frasconi · PDF
  38. Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

    Bryce Grant, Xijia Zhao, Peng Wang · PDF
  39. Reinforce Your Layout: Online Reward-Guided Diffusion for Layout-to-lmage Generation

    Ruijie Li, Xiaojun Shan, Zheng Ding, Zeyuan Chen, Zhuowen Tu · PDF
  40. Rethinking Visual Information Processing in Multimodal LLMs

    Dongwan Kim, Viresh Ranjan, Takashi Nagata, Arnab Dhua, Amit Kumar K C · PDF
  41. RigidBench: Evaluating Rigid-Body Physics in Video Generation Models

    Swarnim Jain, Shangzhe Wu · PDF
  42. Scaling Next-Brain-Token Prediction for MEG

    Richard Csaky · PDF
  43. SCOPE: Selective Cross-modal Orchestration of Visual Perception Experts

    Tianyu Zhang, Suyuchen Wang, Chao Wang, Juan A. Rodriguez, Ahmed Masry, Xiangru Jian, Yoshua Bengio, Perouz Taslakian · PDF
  44. Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

    Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel, Rodrigo Toro Icarte · PDF
  45. StarFlow: Generating Structured Workflow Outputs From Sketch Images

    Patrice Bechard, Chao Wang, Amirhossein Abaskohi, Juan A. Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian · PDF
  46. Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

    Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, Sham M. Kakade, Sitan Chen · PDF
  47. The Efficiency Gap in Byte Modeling

    Celine Lee, Jing Nathan Yan, Chen Liang, Jiaxin Shi, Yin Zhang, Jeremiah Zhe Liu, Pengcheng Yin, Ed H. Chi, Fernando Pereira, Derek Zhiyuan Cheng, Alexander M Rush, Ruoxi Wang · PDF
  48. Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

    Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Vijay Kamarshi, Andrea Fanelli, Furong Huang · PDF
  49. TowerVision: Understanding and Improving Multilinguality in Vision-Language Models

    Andre G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M Guerreiro, Amin Farajian, Graham Neubig, Andre Martins · PDF
  50. UniFusion: Vision-Language Model as Unified Encoder in Image Generation and Editing

    Yu-Teng Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale · PDF
  51. Unifying Autoregressive and Discrete Diffusion Language Modeling via Cross-Regressive Decoding

    Dmitry Abulkhanov, Daniil Strizhakov, Maxim Panov · PDF
  52. Vid2Sid: Videos Can Help Close the Sim2Real Gap

    Kevin Qiu, Yu Zhang, Marek Cygan, Josie Hughes · PDF
  53. VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

    Zirui Wang, Junyi Zhang, Jiaxin Ge, Long Lian, Letian Fu, Lisa Dunlap, Ken Goldberg, XuDong Wang, Ion Stoica, David M. Chan, Sewon Min, Joseph E. Gonzalez · PDF
  54. Visual Representation Alignment for Multimodal Large Language Models

    Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sunghwan Hong, Seungryong Kim · PDF
  55. Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

    Logan Mann, Yi Xia, Ajit Saravanan, Ishan Dave, Saadullah Ismail, Shikhar Shiromani, Emily Huang, Ruizhe Li, Kevin Zhu · PDF
  56. VLM-RobustBench: A Robustness Benchmark for Vision-Language Models

    Rohit Saxena, Alessandro Suglia, Pasquale Minervini · PDF
  57. Worse Together: Understanding the Brittleness of Multimodal Models on Rare Concept Pairs

    Helen Qu, Sang Michael Xie · PDF
  58. You Can Learn Tokenization End-to-End with Reinforcement Learning

    Sam Dauncey, Roger Wattenhofer · PDF
  59. Your Autoregressive Visual Model is a Natively Multi-Token Predictor : Speculative Coupled Decoding for Fast Autoregressive Visual Generation

    Junhyuk So, Hyunho Kook, Chaeyeon Jang, Eunhyeok Park · PDF
  60. Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

    Xiaoqian Shen, Min-Hung Chen, Yu-Chiang Frank Wang, Mohamed Elhoseiny, Ryo Hachiuma · PDF