NeurIPS 2024 Past Large language modelsComputer vision
Workshop on Video-Language Models @ NeurIPS 2024
Video-Langauge Models
- Submission deadline
- Oct 11, 2024, 11:59 UTC imported from OpenReview — check the website for extensions
- Submission portal
- OpenReview
- Notes
- Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).
Accepted papers (27)
Fetched from OpenReview (v2) on 2026-06-10.
-
Can Video Large Language Models Comprehend Language in Videos?
-
CinePile: A Long Video Question Answering Dataset and Benchmark
-
Click & Describe: Multimodal Grounding and Tracking for Aerial Objects
-
Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution
-
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
-
Generative Timelines for Instructed Visual Assembly
-
GUI-WORLD: A GUI-oriented Video Dataset for Multimodal LLM-based Agents
-
HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation
-
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
-
In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models in Low-Level Workflow Understanding
-
Language Repository for Long Video Understanding
-
LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living
-
Matryoshka Multimodal Models
-
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
-
Mobile OS Task Procedure Extraction from YouTube
-
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
-
Quo Vadis, Video Understanding with Vision-Language Foundation Models?
-
RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives
-
Read, Watch and Scream! Sound Generation from Text and Video
-
TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
-
Taskverse: A Benchmark Generation Engine for Multi-modal Language Model
-
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
-
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
-
VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
-
VideoPhy: Evaluating Physical Commonsense for Video Generation
-
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
-
Wolf: Captioning Everything with a World Summarization Framework