ICML 2024PastLarge language modelsRoboticsMultimodal

Multi-modal Foundation Model meets Embodied AI Workshop @ ICML2024

MFM-EAI@ICML2024

Official website ↗OpenReview venue ↗See all ICML workshops →✎ Edit this entry

Submission deadline: May 31, 2024, 23:59 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (23)

Fetched from OpenReview (v2) on 2026-06-10.

An Embodied Generalist Agent in 3D World
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang · PDF
BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks
Stephanie Milani, Anssi Kanervisto, Karolis Jucys, Sander V Schulhoff, Brandon Houghton, Rohin Shah · PDF
Behavior Generation with Latent Actions
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, Lerrel Pinto · PDF
DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning
Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan · PDF
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar · PDF
DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024
Kwanghyeon Lee, Mina Kang, Hyungho Na, HeeSun Bae, Byeonghu Na, Doyun Kwon, Seungjae Shin, Yeongmin Kim, Kim taewoo, Seungmin Yun, Il-chul Moon · PDF
EPD: Long-term Memory Extraction, Context-aware Planning and Multi-iteration Decision @ EgoPlan Challenge ICML 2024
Letian Shi, Qi Lv, Xiang Deng, Liqiang Nie · PDF
GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision
Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang · PDF
Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling
Raunaq Bhirangi, Chenyu Wang, Venkatesh Pattabiraman, Carmel Majidi, Abhinav Gupta, Tess Hellebrekers, Lerrel Pinto · PDF
Instruction-Guided Visual Masking
Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, Xianyuan Zhan · PDF
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Han Xiao, Georgios Mastrapas, Bo Wang · PDF
LEGENT: Open Platform for Embodied Agents
Zhili Cheng, Jinyi Hu, Zhitong Wang, Yuge Tu, Shengding Hu, An Liu, Pengkai Li, Lei Shi, Zhiyuan Liu, Maosong Sun · PDF
LLM3: Large Language Model-based Task and Motion Planning with Motion Failure Reasoning
Shu Wang, Muzhi Han, Ziyuan Jiao, Zeyu Zhang, Ying Nian Wu, Song-Chun Zhu, Hangxin Liu · PDF
Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments
Siddharth Nayak, Adelmo Morrison Orozco, Marina Ten Have, Jackson Zhang, Vittal Thirumalai, Darren Chen, Aditya Kapoor, Eric Robinson, Karthik Gopalakrishnan, James Harrison, Anuj Mahajan, brian ichter, Hamsa Balakrishnan · PDF
MAP-THOR: Benchmarking Long-Horizon Multi-Agent Planning Frameworks in Partially Observable Environments
Siddharth Nayak, Adelmo Morrison Orozco, Marina Ten Have, Vittal Thirumalai, Jackson Zhang, Darren Chen, Aditya Kapoor, Eric Robinson, Karthik Gopalakrishnan, brian ichter, James Harrison, Anuj Mahajan, Hamsa Balakrishnan · PDF
Multimodal foundation world models for generalist embodied agents
Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Aaron Courville, Sai Rajeswar · PDF
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang · PDF
RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective
Chenxi Wang, Hongjie Fang, Hao-Shu Fang, Cewu Lu · PDF
RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model
Hantao Zhou, Tianying Ji, Lukas Sommerhalder, Michael Görner, Norman Hendrich, Fuchun Sun, Jianwei Dr. Zhang, Huazhe Xu · PDF
STREAM: Embodied Reasoning through Code Generation
Daniil Cherniavskii, Phillip Lippe, Andrii Zadaianchuk, Efstratios Gavves · PDF
The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts
Wakana Haijima, Kou Nakakubo, Masahiro Suzuki, Yutaka Matsuo · PDF
Vision-Language Models Provide Promptable Representations for Reinforcement Learning
William Chen, Oier Mees, Aviral Kumar, Sergey Levine · PDF
What can VLMs Do for Zero-shot Embodied Task Planning?
Xian Fu, Min Zhang, Jianye HAO, Peilong Han, Hao Zhang, Lei Shi, Hongyao Tang · PDF

Accepted papers (23)

☆An Embodied Generalist Agent in 3D World

☆BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

☆Behavior Generation with Latent Actions

☆DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

☆DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

☆DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024

☆EPD: Long-term Memory Extraction, Context-aware Planning and Multi-iteration Decision @ EgoPlan Challenge ICML 2024

☆GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision

☆Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling

☆Instruction-Guided Visual Masking

☆Jina CLIP: Your CLIP Model Is Also Your Text Retriever

☆LEGENT: Open Platform for Embodied Agents

☆LLM3: Large Language Model-based Task and Motion Planning with Motion Failure Reasoning

☆Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments

☆MAP-THOR: Benchmarking Long-Horizon Multi-Agent Planning Frameworks in Partially Observable Environments

☆Multimodal foundation world models for generalist embodied agents

☆OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

☆RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective

☆RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

☆STREAM: Embodied Reasoning through Code Generation

☆The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts

☆Vision-Language Models Provide Promptable Representations for Reinforcement Learning

☆What can VLMs Do for Zero-shot Embodied Task Planning?

An Embodied Generalist Agent in 3D World

BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks

Behavior Generation with Latent Actions

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024

EPD: Long-term Memory Extraction, Context-aware Planning and Multi-iteration Decision @ EgoPlan Challenge ICML 2024

GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision

Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling

Instruction-Guided Visual Masking

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

LEGENT: Open Platform for Embodied Agents

LLM3: Large Language Model-based Task and Motion Planning with Motion Failure Reasoning

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments

MAP-THOR: Benchmarking Long-Horizon Multi-Agent Planning Frameworks in Partially Observable Environments

Multimodal foundation world models for generalist embodied agents

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective

RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model

STREAM: Embodied Reasoning through Code Generation

The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

What can VLMs Do for Zero-shot Embodied Task Planning?