From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

Mehar Bhatia, Sahithya Ravi^*, Aditya Chinchure^*, Eunjeong Hwang, Vered Shwartz

University of British Columbia, Vector Institute of AI
Published at EMNLP'24 Main
^*Indicates Equal Contribution

Paper Code Data (Task1 Retrieval on HF) Data (Task2 Grounding on HF)

An example instance from each task in GlobalRG benchmark: (i) Retrieval Across Universals measures the ability of VLMs to retrieve culturally diverse images for a query q. (ii) Cultural Visual Grounding aims to evaluate the ability of VLMs to identify a cultural concept q.

Research Questions Discussed

Are able to (TASK 1) retrieve relevant and culturally diverse images for universals AND (TASK2) ground culture-specific local concepts?
Do VLMs exhibit biases towards images from specific cultures?
What are the challenges faced by VLMs in achieving (Task 1) high cultural diversity AND (TASK 2) grounding culture-specific concepts??

Task 1: Retrieval Across Universals

Task Definition

Let \( Q = \{q_1, q_2, \ldots, q_n\} \) be a set of textual queries representing universal concepts, and \( I = \{I_1, I_2, \ldots, I_m\} \) the set of images from different cultures. Given a query \( q \in Q \), the goal is to retrieve a ranked list of images \( R(q, I) = \{I_{r1}, I_{r2}, \ldots, I_{rk} \} \subset I \) that maximizes both relevance and cultural diversity.

Relevance: \( \text{Rel}(q, I) \) refers to how well the image \( I \) matches the query \( q \) captured by the standard precision@k.
Diversity: \( \text{Div}(R(q, I)) \) measures the cultural diversity of the retrieved images using the formula,

\[ \text{diversity@k} = - \frac{1}{\log m} \sum_{i=1}^{m} p_i \log(p_i) \]

where \( p_i \) is the proportion of images from the \( i \)-th culture in the top \( k \) retrieved images \( R(q) \),
and \( m \) is the total number of cultures in the top \( k \).

A high entropy value (∼ 100) indicates high diversity, retrieved images are well-distributed across different cultures.
Conversely, a low entropy value (∼ 0) indicates low diversity, retrieved images are biased towards specific cultures.

Dataset Statistics

List of 50 cultures covered in the retrieval task.

Human universals used as textual queries in 'Retrieval across Universals' task.

Model Performance

Average performance of various VLMs on the Retrieval across universals task, in terms of Relevance and Diversity.

Top-5 images retrieved for a sample of universals by models CLIP, CoCA and BLIP-2.

Third image description.

Task 2: Cultural Visual Grounding

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models

Abstract

An example instance from each task in GlobalRG benchmark: (i) Retrieval Across Universals measures the ability of VLMs to retrieve culturally diverse images for a query q. (ii) Cultural Visual Grounding aims to evaluate the ability of VLMs to identify a cultural concept q.

Research Questions Discussed

Task 1: Retrieval Across Universals

Task Definition

Dataset Statistics

List of 50 cultures covered in the retrieval task.

Human universals used as textual queries in 'Retrieval across Universals' task.

Model Performance

Average performance of various VLMs on the Retrieval across universals task, in terms of Relevance and Diversity.

Top-5 images retrieved for a sample of universals by models CLIP, CoCA and BLIP-2.

Third image description.

Task 2: Cultural Visual Grounding

Dataset Statistics

Detailed statistics of annotated images across different cultural groups and regions for Cultural Visual Grounding task.

Model Performance

Country-level Accuracy of each model on Cultural Visual Grounding task.

Country group-level Accuracy of eachmodel on Cultural Visual Grounding task.

Examples

Citation