NeurIPS 2024 Past Large language modelsComputer vision

Workshop on Video-Language Models @ NeurIPS 2024

Video-Langauge Models

Submission deadline
Oct 11, 2024, 11:59 UTC
imported from OpenReview — check the website for extensions
Submission portal
OpenReview
Notes
Auto-imported from the OpenReview venue record on 2026-06-10 — please verify and enrich (topics are keyword-guessed).

Accepted papers (27)

Fetched from OpenReview (v2) on 2026-06-10.

  1. Can Video Large Language Models Comprehend Language in Videos?

    Minjoon Jung, Junbin Xiao, Byoung-Tak Zhang, Angela Yao · PDF
  2. CinePile: A Long Video Question Answering Dataset and Benchmark

    Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, Tom Goldstein · PDF
  3. Click & Describe: Multimodal Grounding and Tracking for Aerial Objects

    Rupanjali Kukal, Jay Patravali, Fuxun Yu, Simranjit Singh, Nikolaos Karianakis, Rishi Madhok · PDF
  4. Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution

    Timothy Wei, Hsien Xin Peng, Elaine Xu, Bryan Zhao, Lei Ding, Diji Yang · PDF
  5. Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties

    Keunwoo Peter Yu, Zheyuan Zhang, Fengyuan Hu, Shane Storks, Joyce Chai · PDF
  6. Generative Timelines for Instructed Visual Assembly

    Alejandro Pardo, Jui-Hsien Wang, Bernard Ghanem, Josef Sivic, Bryan Russell, Fabian Caba Heilbron · PDF
  7. GUI-WORLD: A GUI-oriented Video Dataset for Multimodal LLM-based Agents

    Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Huichi Zhou, Qihui Zhang, Zhigang He, Yilin Bai, Chujie Gao, Liuyi Chen, Yiqiang Li, Chenlong Wang, Yue Yu, Tianshuo Zhou, Zhen Li, Yi Gui, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun · PDF
  8. HiMemFormer: Hierarchical Memory-Aware Transformer for Multi-Agent Action Anticipation

    Zirui Wang, Xinran Zhao, Simon Stepputtis, Woojun Kim, Tongshuang Wu, Katia P. Sycara, Yaqi Xie · PDF
  9. IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning

    Soeun Lee, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim · PDF
  10. In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models in Low-Level Workflow Understanding

    Moucheng Xu, Evangelos Chatzaroulas, Luc McCutcheon, Abdul Ahad, Hamzah Azeem, Janusz Marecki, Ammar Anwar · PDF
  11. Language Repository for Long Video Understanding

    Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S Ryoo · PDF
  12. LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

    Rajatsubhra Chakraborty, Arkaprava Sinha, Dominick Reilly, Manish Kumar Govind, Pu Wang, Francois Bremond, Srijan Das · PDF
  13. Matryoshka Multimodal Models

    Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee · PDF
  14. MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

    Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang · PDF
  15. Mobile OS Task Procedure Extraction from YouTube

    Yunseok Jang, Yeda Song, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Honglak Lee · PDF
  16. MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

    Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu · PDF
  17. Quo Vadis, Video Understanding with Vision-Language Foundation Models?

    Mahmoud ALI, Di Yang, Arkaprava Sinha, Dominick Reilly, Srijan Das, Gianpiero Francesca, Francois Bremond · PDF
  18. RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives

    Jaehong Yoon, Shoubin Yu, Mohit Bansal · PDF
  19. Read, Watch and Scream! Sound Generation from Text and Video

    Yujin Jeong, Yunji Kim, Sanghyuk Chun, Jiyoung Lee · PDF
  20. TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

    Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang · PDF
  21. Taskverse: A Benchmark Generation Engine for Multi-modal Language Model

    Jieyu Zhang, Weikai Huang, Zixian Ma, Oscar Michel, Dong He, Tanmay Gupta, Wei-Chiu Ma, Ali Farhadi, Aniruddha Kembhavi, Ranjay Krishna · PDF
  22. TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Yao Feng, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao, Yong Jae Lee, Jianwei Yang · PDF
  23. Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA

    Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryu, Donghyun Kim, Michael S Ryoo · PDF
  24. VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

    Jing Gu, Yuwei Fang, Ivan Skorokhodov, Peter Wonka, Xinya Du, Sergey Tulyakov, Xin Eric Wang · PDF
  25. VideoPhy: Evaluating Physical Commonsense for Video Generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, Aditya Grover · PDF
  26. VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

    Ahmad Mahmood, Ashmal Vayani, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan · PDF
  27. Wolf: Captioning Everything with a World Summarization Framework

    Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, Xinshuo Weng, Fuzhao Xue, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor Darrell, Jitendra Malik, Song Han, Marco Pavone · PDF