CVPR 2025PastLarge language modelsComputer vision

CVPR 2025 Workshop Vision Language Models For All

VLMs4All 2025

Official website ↗OpenReview venue ↗See all CVPR workshops →✎ Edit this entry

Submission deadline: May 1, 2025, 12:00 UTC
imported from OpenReview — check the website for extensions
Submission portal: OpenReview
Notes: Topics were auto-suggested and may be imprecise — edits welcome.

Accepted papers (18)

Fetched from OpenReview (v2) on 2026-06-10.

Behind Maya: Building a Multilingual Vision Language Model
Nahid Alam, Karthik reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Cecilia Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth.S, Snehanshu Mukherjee, Alham Fikri Aji · PDF
Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models
Srishti Yadav, Zhi Zhang, Daniel Hershcovich, Ekaterina Shutova · PDF
Challenging Multimodal LLMs with African Standardized Exams: A Document VQA Evaluation
Victor Tolulope Olufemi, Oreoluwa Boluwatife Babatunde, Emmanuel Bolarinwa, Kausar Yetunde Moshood · PDF
Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages
Shaharukh Khan, Ali Faraz, Abhinav Ravi, Mohd Nauman, Mohd Sarfraz, Akshat Patidar, Raja Kolla, Chandra Khatri, Shubham Agarwal · PDF
CONCAP: Seeing Beyond English with Retrieval-Augmented Captioning
George Ibrahim, Rita Ramos, Yova Kementchedjhieva · PDF
Cultural Awareness in Vision-Language Models: A Cross-Country Exploration
Avinash Madasu, Vasudev Lal, Phillip Howard · PDF
Culturally-Aware Financial Fraud Detection Using Vision-Language Models
Huangqi Jiang · PDF
CultureShift: Mapping Temporal Cultural Evolution in Vision-Language Models
Gautam Jajoo, Harsh Deshpande, Hamna, Pranjal A Chitale · PDF
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries
Shudong Liu, Yiqiao Jin, CHENG LI, Derek F. Wong, Qingsong Wen, Lichao Sun, Haipeng Chen, Xing Xie, Jindong Wang · PDF
Enhancing Cultural Awareness in Vision-Language Models: The Power of Multimodal Few-Shot Prompting
Pitikorn Khlaisamniang · PDF
Enhancing Vision-Language Models for Global Cultural Understanding through Semantic Expansion and Diversity Reranking
Zirui Hou, xiangzhe yin, Guangyu Gao · PDF
GeoDiv: Measuring Concept Diversity of Images Across Geographical Regions
Abhipsa Basu, Mohana Singh, Venkatesh Babu Radhakrishnan · PDF
JEEM: Vision-Language Understanding in Four Arabic Dialects
Karima Kadaoui, Hanin atwany, Hamdan Al-Ali, Abdelrahman Mohamed, Ali Mekky, Sergei Tilga, Natalia Fedorova, Ekaterina Artemova, Hanan Aldarmaki, Yova Kementchedjhieva · PDF
Nayana: A Foundation for Document-Centric Vision-Language Models via Multi-Task, Multimodal, and Multilingual Data Synthesis
Adithya S Kolavi, Samarth P, Vyoman Jain · PDF
RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives
Chirag Parikh, Deepti Rawat, Rakshitha R. T., Tathagata Ghosh, Ravi Kiran Sarvadevabhatla · PDF
Synthetic Document Question Answering in Hungarian
Jonathan Lingjie Li, Zoltan Csaki, Nidhi Hiremath, Etash Kumar Guha, Fenglu Hong, King Chun Ma, Urmish Thakker · PDF
The use of multi-modal models and machine learning tech-niques to improve the efficiency and accuracy of geospatial data analysis
Matthew C Gaskins · PDF
Why do LLaVA Vision-Language Models Reply to Images in English?
Musashi Hinck, Carolin Holtermann, Matthew Lyle Olson, Florian Schneider, Sungduk Yu, Anahita Bhiwandiwalla, Anne Lauscher, Shao-Yen Tseng, Vasudev Lal · PDF

Accepted papers (18)

☆Behind Maya: Building a Multilingual Vision Language Model

☆Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models

☆Challenging Multimodal LLMs with African Standardized Exams: A Document VQA Evaluation

☆Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

☆CONCAP: Seeing Beyond English with Retrieval-Augmented Captioning

☆Cultural Awareness in Vision-Language Models: A Cross-Country Exploration

☆Culturally-Aware Financial Fraud Detection Using Vision-Language Models

☆CultureShift: Mapping Temporal Cultural Evolution in Vision-Language Models

☆CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

☆Enhancing Cultural Awareness in Vision-Language Models: The Power of Multimodal Few-Shot Prompting

☆Enhancing Vision-Language Models for Global Cultural Understanding through Semantic Expansion and Diversity Reranking

☆GeoDiv: Measuring Concept Diversity of Images Across Geographical Regions

☆JEEM: Vision-Language Understanding in Four Arabic Dialects

☆Nayana: A Foundation for Document-Centric Vision-Language Models via Multi-Task, Multimodal, and Multilingual Data Synthesis

☆RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

☆Synthetic Document Question Answering in Hungarian

☆The use of multi-modal models and machine learning tech-niques to improve the efficiency and accuracy of geospatial data analysis

☆Why do LLaVA Vision-Language Models Reply to Images in English?

Behind Maya: Building a Multilingual Vision Language Model

Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models

Challenging Multimodal LLMs with African Standardized Exams: A Document VQA Evaluation

Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

CONCAP: Seeing Beyond English with Retrieval-Augmented Captioning

Cultural Awareness in Vision-Language Models: A Cross-Country Exploration

Culturally-Aware Financial Fraud Detection Using Vision-Language Models

CultureShift: Mapping Temporal Cultural Evolution in Vision-Language Models

CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries

Enhancing Cultural Awareness in Vision-Language Models: The Power of Multimodal Few-Shot Prompting

Enhancing Vision-Language Models for Global Cultural Understanding through Semantic Expansion and Diversity Reranking

GeoDiv: Measuring Concept Diversity of Images Across Geographical Regions

JEEM: Vision-Language Understanding in Four Arabic Dialects

Nayana: A Foundation for Document-Centric Vision-Language Models via Multi-Task, Multimodal, and Multilingual Data Synthesis

RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives

Synthetic Document Question Answering in Hungarian

The use of multi-modal models and machine learning tech-niques to improve the efficiency and accuracy of geospatial data analysis

Why do LLaVA Vision-Language Models Reply to Images in English?