NeurIPS 2025 Past RoboticsComputer vision
NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI
SpaVLE
- Submission deadline
- Sep 3, 2025, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (56)
Fetched from OpenReview (v2) on 2026-06-10.
-
An Emergent Symbolic Representation of Space as a Bridge Between Language and Reinforcement Learning in Continuous Environments
-
Avi: A 3D Vision-Language Action Model Architecture generating Action from Volumetric Inference
-
BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning
-
Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
-
Bridging Embodiment Gaps: Deploying Vision-Language-Action Models on Soft Robots
-
COREVQA: Spatial Reasoning and Multi-Step Visual Entailment in Crowded Environments
-
DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation
-
Evaluation of Vision-LLMs in Surveillance Video
-
Every Camera Effect, Every Time, All at Once: 4D Gaussian Ray Tracing for Physics-based Camera Effect Data Generation
-
FINDINGDORY: A Benchmark to Evaluate Memory in Embodied Agents
-
Flow Equivariant World Models: Structured Dynamics Outside the Field of View
-
FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing
-
From Static Domain Adaptation to State-Adaptive Perception in Embodied Agents
-
GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data?
-
Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
-
Hierarchical Equivariant Policy via Frame Transfer
-
Hierarchical Object-Oriented POMDP Planning for Object Rearrangement
-
I Know Kung Fu: Synthetic Dexterous Hand Demonstration Collection via VR Teleoperation
-
Improving Vision-and-Language Navigation with Explicit Sub-Instruction Alignment
-
LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors
-
LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning
-
Learning Dynamics of Multitask Training Data in Vision Language Models
-
Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
-
Maestro: Orchestrating Robotics Modules with Vision-Language Models for Zero-Shot Generalist Robots
-
Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark
-
MetaVLA: Unified Meta Co-Training for Efficient Embodied Adaptation
-
Motion as Language: Towards a Situation–Motion Language for Spatio-Temporal Learning
-
NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language
-
NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
-
Object-Centric Agentic Robot Policies
-
Probing the Limits of Embodied Spatial Planning in LLMs
-
Rethinking the Simulation vs. Rendering Dichotomy: No Free Lunch in Spatial World Modelling
-
Revisiting Depth Representations for Feed-Forward 3D Gaussian Splatting
-
RIV-CoT: Retrieval-Based Interleaved Visual Chain-of-Thought for Multimodal Reasoning
-
RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems
-
ROSE: Reconstructing Objects, Scenes, and Trajectories from Casual Videos for Robotic Manipulation
-
See it. Say it. Sorted: Agentic System for Compositional Diagram Generation
-
Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition
-
Self-Augmented Learning of Differentiable Object Models for Compositional Interpretation of Complex Scenes
-
SITCOM: Scaling Inference-Time COMpute for VLAs
-
Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding
-
SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
-
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
-
Spatio-Temporal Grounding of Large Language Models from Perception Streams
-
SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
-
TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control
-
Think, Remember, Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning
-
TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance
-
Towards Understanding Multimodal Fine-Tuning: A Case Study into Spatial Features
-
TriFusion-AE: Language-Guided Depth and LiDAR Fusion for Robust Point Cloud Processing
-
VFSI: Validity First Spatial Intelligence for Constraint-Guided Traffic Diffusion
-
Viewpoint-Invariant Latent Action Learning from Human Video Demonstrations
-
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
-
ViPRA: Video Prediction for Robot Actions
-
Weakly-supervised Latent Models for Task-specific Visual-Language Control
-
Wholly Unsupervised! Segmenting Objects by Contrast and Context