ICLR 2026 Past Multimodal
ICLR 2026 Workshop on Multimodal Intelligence: Next Token Prediction & Beyond
ICLR 2026 Workshop MM Intelligence
- Submission deadline
- Feb 6, 2026, 13:00 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (60)
Fetched from OpenReview (v2) on 2026-06-10.
-
A Single Image and Multimodality Is All You Need for Novel View Synthesis
-
A Systematic Study of Behavioral Cloning for Scientific Data Annotation
-
AdaTS: Adaptive Token Sampling for Efficient Speech Language Models
-
An Efficient Training Pipeline for Reasoning Graphical User Interface Agents
-
Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning
-
BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
-
Bridging Generative and Predictive Paradigms via Hidden-Self-Distillation
-
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
-
Can Vision Models Process Physiological Signals? Exploring Visual Tokenization as a Representation Interface
-
CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
-
City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs
-
CMRAG: Co-modality-based visual document retrieval and question answering
-
CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks
-
Data Provenance for Image Auto-Regressive Generation
-
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
-
Depth Over Specialization in Small Multimodal Transformers
-
Diagnosing the Curse: A Scale-Consistent and All-Phase Metric for Modality Bias in MLLMs
-
DiffuMamba: High-Throughput Diffusion LMs with Mamba Backbone
-
DISCO: Document Intelligence Suite for COmparative Evaluation
-
Does Semantic Noise Initialization Transfer from Images to Videos? A Paired Diagnostic Study
-
Efficient Multimodal Generation via Redundancy-Aware Mixture-of-Experts
-
ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
-
Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
-
Fine-Tuning Masked Diffusion for Provable Self-Correction
-
GHVL: Geometry-Grounded Hyperbolic Vision-Language Models for Hierarchical Multimodal Representation Learning
-
GRAID: Enhancing Spatial Reasoning of VLMs through High-Fidelity Data Generation
-
Growing Visual Generative Capacity for Pre-Trained MLLMs
-
INDEX-PRESERVING LIGHTWEIGHT TOKEN PRUNING FOR EFFICIENT DOCUMENT UNDERSTANDING IN VISION-LANGUAGE MODELS
-
Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models
-
Is Extending Modality The Right Path Towards Omni-Modality?
-
LanteRn: Latent Visual Structured Reasoning
-
MapQA: A Map-Question-Answering Benchmark for Visual Language Model Reasoning
-
MLLMs are Deeply Affected by Modality Bias
-
Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding
-
Multimodal Language Models Cannot Spot Spatial Inconsistencies
-
Neural Signals Generate Clinical Notes in the Wild
-
Next-Scale Autoregression on Spectrograms for Sound Generation
-
Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models
-
Reinforce Your Layout: Online Reward-Guided Diffusion for Layout-to-lmage Generation
-
Rethinking Visual Information Processing in Multimodal LLMs
-
RigidBench: Evaluating Rigid-Body Physics in Video Generation Models
-
Scaling Next-Brain-Token Prediction for MEG
-
SCOPE: Selective Cross-modal Orchestration of Visual Perception Experts
-
Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
-
StarFlow: Generating Structured Workflow Outputs From Sketch Images
-
Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training
-
The Efficiency Gap in Byte Modeling
-
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
-
TowerVision: Understanding and Improving Multilinguality in Vision-Language Models
-
UniFusion: Vision-Language Model as Unified Encoder in Image Generation and Editing
-
Unifying Autoregressive and Discrete Diffusion Language Modeling via Cross-Regressive Decoding
-
Vid2Sid: Videos Can Help Close the Sim2Real Gap
-
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
-
Visual Representation Alignment for Multimodal Large Language Models
-
Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models
-
VLM-RobustBench: A Robustness Benchmark for Vision-Language Models
-
Worse Together: Understanding the Brittleness of Multimodal Models on Rare Concept Pairs
-
You Can Learn Tokenization End-to-End with Reinforcement Learning
-
Your Autoregressive Visual Model is a Natively Multi-Token Predictor : Speculative Coupled Decoding for Fast Autoregressive Visual Generation
-
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in