No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

Source

Raw Markdown: paper_no-filter-cultural-socioeconomic-diversity-2024.md
PDF: paper_no-filter-cultural-socioeconomic-diversity-2024.pdf
Preprint: arXiv 2405.13777
OpenReview: NeurIPS 2024 poster

No dedicated official code repository was found during ingestion. The paper reports that models were developed in the Google Research big_vision codebase and trained on Google Cloud TPUs.

Status And Credibility

Submitted to arXiv on 2024-05-22, updated as v3 on 2024-10-23, and published as a NeurIPS 2024 poster. This is credible peer-reviewed evidence from an ETH Zürich / Google DeepMind team: Angéline Pouget, Lucas Beyer, Emanuele Bugliarello, Xiao Wang, Andreas Peter Steiner, Xiaohua Zhai, and Ibrahim Alabdulmohsin. Although the paper is older than one year relative to this wiki update, it remains directly relevant because it is a tier-1 venue result about data filtering, benchmark mismatch, and distributional diversity in contrastive vision-language model pretraining.

Core Claim

Filtering image-text pretraining data to English-only pairs improves popular Western-oriented benchmarks such as ImageNet and COCO, but it damages cultural and socioeconomic diversity. The paper argues that contrastive VLMs should pretrain on global, minimally filtered data, then use English-only fine-tuning or data mixing if standard Western-centric benchmark performance is also required.

The key lesson is not that every filter is bad. It is that a filter optimized for convenience or popular aggregate benchmarks can remove the distributional support needed for underrepresented regions, languages, income levels, objects, and landmarks.

Key Contributions

Demonstrates that English-only image-text filtering hurts cultural understanding and disproportionately affects lower-socioeconomic-status communities.
Shows that this damage can be invisible, or even inverted, under popular ImageNet and COCO metrics.
Introduces few-shot geo-localization as an evaluation task for cultural diversity in VLM image representations.
Separates cultural diversity from multilinguality by comparing raw multilingual global data (globe), English-only data (en), and global images with captions machine-translated to English (globe-tl).
Shows that pretraining on global data followed by short English-only fine-tuning can improve cultural metrics without sacrificing standard benchmarks as much as English-only pretraining.
Tests quality-filtered variants and reports that the core conclusion remains: English-only filtering still harms cultural diversity even under quality filtering.

Method Notes

The paper studies SigLIP-style contrastive VLMs trained on WebLI-based data variants:

globe: raw multilingual global data with minimal filtering, such as sensitive and personally identifiable information removal.
en: the English-caption subset, mirroring common filtering in CLIP/ALIGN/SigLIP-like pipelines and LAION/DataComp usage.
globe-tl: the same images as globe, with non-English captions machine-translated to English.

Models are ViT-B/16 image encoders plus Transformer text encoders trained with a SigLIP sigmoid contrastive loss. Each model sees 10B image-text pairs, with three random seeds per data variant. Evaluation covers culturally diverse datasets such as Dollar Street, Google Landmarks Dataset v2, GeoDE, MaRVL, and XM3600, plus popular Western-oriented benchmarks ImageNet and COCO.

flowchart LR
  WebLI["WebLI global image-text pairs"] --> Globe["globe: multilingual/global"]
  WebLI --> En["en: English-only filter"]
  WebLI --> TL["globe-tl: global images + English translations"]
  Globe --> Train["SigLIP pretraining"]
  En --> Train
  TL --> Train
  Train --> Western["ImageNet / COCO"]
  Train --> Diverse["Dollar Street / GLDv2 / GeoDE / MaRVL / geo-localization"]

Evidence And Results

The paper reports a consistent trade-off. English-only models perform best on prevalent Western-oriented benchmarks in the main table: for example, en is higher than globe and globe-tl on ImageNet zero-shot and COCO retrieval. But this comes at a cost: globally trained or translated-global models perform better on culturally diverse zero-shot evaluations and geo-localization.

Examples of the reported pattern:

Dollar Street, GLDv2, GeoDE, and MaRVL improve when moving away from English-only pretraining, while ImageNet/COCO often decline.
Dollar Street disaggregation shows a larger performance gap for lower-income households under English-only filtering.
Few-shot geo-localization strongly favors globe or globe-tl, suggesting that English-only pretraining loses country- and region-specific image representation information.
XM3600 retrieval is not sufficient as a cultural-diversity metric because it can miss the geographically and culturally specific visual distribution shifts that geo-localization captures.
Fine-tuning globe-tl on en data is a better trade-off than the reverse: pretraining on global data preserves cultural representation better, and later English fine-tuning can recover standard benchmark performance.
Data mixing can also navigate the trade-off, but it requires training from scratch and is more expensive than short fine-tuning in their setup.

Relation To Dynamic Curriculum Learning

This source is a benchmark-hygiene and distribution-support warning for Dynamic Curriculum Learning For JEPA.

The dynamic-curriculum idea often starts from a corpus dominated by repetitive normal behavior. That makes filtering or reweighting attractive. This paper warns that the filter target matters: if the filter uses a narrow proxy, it can improve a familiar aggregate metric while removing exactly the long-tail distribution needed for robust representations.

The time-series analogue is direct:

English-only filtering corresponds to keeping only the easiest or most benchmark-aligned slice of a corpus.
Cultural and socioeconomic coverage corresponds to rare regimes, tail tenants, unusual devices, low-frequency faults, regional seasonality, and underrepresented operating modes.
ImageNet/COCO mismatch corresponds to average forecast error or aggregate anomaly scores that can look good while tail-state understanding degrades.
Geo-localization corresponds to per-regime probes that ask whether the representation still encodes where a sample comes from in the underlying distribution.

For dynamic curriculum learning, this suggests a stricter contract:

filter repeated or corrupt windows,
but preserve distributional diversity and tail support.

The curriculum should therefore include uniform floors, per-regime quotas or soft constraints, embedding-coverage diagnostics, tail-slice probes, and normal-retention checks. Filtering should be evaluated by whether it preserves useful state diversity, not only by whether it improves average loss faster.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Data diversity, curriculum, and long tail	warning	Shows that a common filter can improve standard benchmarks while hurting underrepresented distribution slices.	Need time-series curricula that report per-regime and tail-tenant metrics, not only aggregate forecasting or anomaly scores.
Benchmark level	warning	Demonstrates that ImageNet/COCO can be at odds with cultural-diversity metrics; XM3600 retrieval can miss cultural visual diversity.	Need TSFM benchmark suites with region/regime/tail slices and representation probes analogous to geo-localization.
Representation quality	adjacent	Few-shot geo-localization shows that image representations trained on global data preserve country/region information better than English-filtered ones.	Need analogous probes for latent state: regime, topology, tenant, device type, intervention phase, and rare-event context.
Data mixture and adaptation	adjacent	Global pretraining followed by English fine-tuning gives a better trade-off than English pretraining followed by global fine-tuning.	Need TSFM tests of broad/diverse pretraining followed by domain-specific fine-tuning versus narrow pretraining followed by broad adaptation.

Links Into The Wiki

Limitations And Gotchas

The source is about contrastive VLMs, not language-only LLMs, multivariate time series, event streams, JEPA, or action-conditioned world models.
The training data is WebLI-derived and not fully public, so reproduction outside Google DeepMind may be limited.
The evidence is primarily SigLIP-style encoder-only contrastive training, not generative VLMs.
Cultural diversity is approximated through region, country, language, landmark, object, and income-slice evaluations; it is not a complete definition of culture.
The paper argues against English-only filtering and benchmark-narrow filtering, not against privacy, safety, PII, deduplication, or quality filters that preserve global support.
Translation can preserve some global coverage while improving English-prompt usability, but it can also remove culturally meaningful linguistic context.

Open Questions

What is the time-series equivalent of geo-localization: tenant identification, topology region, device class, regime probe, intervention phase, or environment metadata prediction?
Can dynamic curriculum learning preserve tail-regime representations while still reducing repeated normal windows?
Which benchmark slices reveal distributional damage that average forecast error hides?
Should time-series pretraining start broad and diverse, then fine-tune on target domains, rather than pretraining narrowly and hoping later broad adaptation recovers missing regimes?
How should quality filters distinguish harmful samples from culturally, regionally, or operationally rare samples?
Can translation-like normalization in time series, such as unit conversion, channel metadata normalization, or topology remapping, preserve diversity without erasing local context?

Alex Open Research Wiki

Explorer

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

Source

Status And Credibility

Core Claim

Key Contributions

Method Notes

Evidence And Results

Relation To Dynamic Curriculum Learning

Foundation TSFM Relevance

Links Into The Wiki

Limitations And Gotchas

Open Questions

Graph View

Table of Contents

Backlinks