ICML 2024 Past Large language modelsRoboticsMultimodal
Multi-modal Foundation Model meets Embodied AI Workshop @ ICML2024
MFM-EAI@ICML2024
- Submission deadline
- May 31, 2024, 23:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (23)
Fetched from OpenReview (v2) on 2026-06-10.
-
An Embodied Generalist Agent in 3D World
-
BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks
-
Behavior Generation with Latent Actions
-
DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning
-
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
-
DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024
-
EPD: Long-term Memory Extraction, Context-aware Planning and Multi-iteration Decision @ EgoPlan Challenge ICML 2024
-
GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision
-
Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling
-
Instruction-Guided Visual Masking
-
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
-
LEGENT: Open Platform for Embodied Agents
-
LLM3: Large Language Model-based Task and Motion Planning with Motion Failure Reasoning
-
Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments
-
MAP-THOR: Benchmarking Long-Horizon Multi-Agent Planning Frameworks in Partially Observable Environments
-
Multimodal foundation world models for generalist embodied agents
-
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
-
RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective
-
RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model
-
STREAM: Embodied Reasoning through Code Generation
-
The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts
-
Vision-Language Models Provide Promptable Representations for Reinforcement Learning
-
What can VLMs Do for Zero-shot Embodied Task Planning?