No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
Source
- Raw Markdown: paper_no-filter-cultural-socioeconomic-diversity-2024.md
- PDF: paper_no-filter-cultural-socioeconomic-diversity-2024.pdf
- Preprint: arXiv 2405.13777
- OpenReview: NeurIPS 2024 poster
No dedicated official code repository was found during ingestion. The paper reports that models were developed in the Google Research big_vision codebase and trained on Google Cloud TPUs.
Status And Credibility
Submitted to arXiv on 2024-05-22, updated as v3 on 2024-10-23, and published as a NeurIPS 2024 poster. This is credible peer-reviewed evidence from an ETH Zürich / Google DeepMind team: Angéline Pouget, Lucas Beyer, Emanuele Bugliarello, Xiao Wang, Andreas Peter Steiner, Xiaohua Zhai, and Ibrahim Alabdulmohsin. Although the paper is older than one year relative to this wiki update, it remains directly relevant because it is a tier-1 venue result about data filtering, benchmark mismatch, and distributional diversity in contrastive vision-language model pretraining.
Core Claim
Filtering image-text pretraining data to English-only pairs improves popular Western-oriented benchmarks such as ImageNet and COCO, but it damages cultural and socioeconomic diversity. The paper argues that contrastive VLMs should pretrain on global, minimally filtered data, then use English-only fine-tuning or data mixing if standard Western-centric benchmark performance is also required.
The key lesson is not that every filter is bad. It is that a filter optimized for convenience or popular aggregate benchmarks can remove the distributional support needed for underrepresented regions, languages, income levels, objects, and landmarks.
Key Contributions
- Demonstrates that English-only image-text filtering hurts cultural understanding and disproportionately affects lower-socioeconomic-status communities.
- Shows that this damage can be invisible, or even inverted, under popular ImageNet and COCO metrics.
- Introduces few-shot geo-localization as an evaluation task for cultural diversity in VLM image representations.
- Separates cultural diversity from multilinguality by comparing raw multilingual global data (
globe), English-only data (en), and global images with captions machine-translated to English (globe-tl). - Shows that pretraining on global data followed by short English-only fine-tuning can improve cultural metrics without sacrificing standard benchmarks as much as English-only pretraining.
- Tests quality-filtered variants and reports that the core conclusion remains: English-only filtering still harms cultural diversity even under quality filtering.
Method Notes
The paper studies SigLIP-style contrastive VLMs trained on WebLI-based data variants:
globe: raw multilingual global data with minimal filtering, such as sensitive and personally identifiable information removal.en: the English-caption subset, mirroring common filtering in CLIP/ALIGN/SigLIP-like pipelines and LAION/DataComp usage.globe-tl: the same images asglobe, with non-English captions machine-translated to English.
Models are ViT-B/16 image encoders plus Transformer text encoders trained with a SigLIP sigmoid contrastive loss. Each model sees 10B image-text pairs, with three random seeds per data variant. Evaluation covers culturally diverse datasets such as Dollar Street, Google Landmarks Dataset v2, GeoDE, MaRVL, and XM3600, plus popular Western-oriented benchmarks ImageNet and COCO.
flowchart LR WebLI["WebLI global image-text pairs"] --> Globe["globe: multilingual/global"] WebLI --> En["en: English-only filter"] WebLI --> TL["globe-tl: global images + English translations"] Globe --> Train["SigLIP pretraining"] En --> Train TL --> Train Train --> Western["ImageNet / COCO"] Train --> Diverse["Dollar Street / GLDv2 / GeoDE / MaRVL / geo-localization"]
Evidence And Results
The paper reports a consistent trade-off. English-only models perform best on prevalent Western-oriented benchmarks in the main table: for example, en is higher than globe and globe-tl on ImageNet zero-shot and COCO retrieval. But this comes at a cost: globally trained or translated-global models perform better on culturally diverse zero-shot evaluations and geo-localization.
Examples of the reported pattern:
- Dollar Street, GLDv2, GeoDE, and MaRVL improve when moving away from English-only pretraining, while ImageNet/COCO often decline.
- Dollar Street disaggregation shows a larger performance gap for lower-income households under English-only filtering.
- Few-shot geo-localization strongly favors
globeorglobe-tl, suggesting that English-only pretraining loses country- and region-specific image representation information. - XM3600 retrieval is not sufficient as a cultural-diversity metric because it can miss the geographically and culturally specific visual distribution shifts that geo-localization captures.
- Fine-tuning
globe-tlonendata is a better trade-off than the reverse: pretraining on global data preserves cultural representation better, and later English fine-tuning can recover standard benchmark performance. - Data mixing can also navigate the trade-off, but it requires training from scratch and is more expensive than short fine-tuning in their setup.
Relation To Dynamic Curriculum Learning
This source is a benchmark-hygiene and distribution-support warning for Dynamic Curriculum Learning For JEPA.
The dynamic-curriculum idea often starts from a corpus dominated by repetitive normal behavior. That makes filtering or reweighting attractive. This paper warns that the filter target matters: if the filter uses a narrow proxy, it can improve a familiar aggregate metric while removing exactly the long-tail distribution needed for robust representations.
The time-series analogue is direct:
- English-only filtering corresponds to keeping only the easiest or most benchmark-aligned slice of a corpus.
- Cultural and socioeconomic coverage corresponds to rare regimes, tail tenants, unusual devices, low-frequency faults, regional seasonality, and underrepresented operating modes.
- ImageNet/COCO mismatch corresponds to average forecast error or aggregate anomaly scores that can look good while tail-state understanding degrades.
- Geo-localization corresponds to per-regime probes that ask whether the representation still encodes where a sample comes from in the underlying distribution.
For dynamic curriculum learning, this suggests a stricter contract:
filter repeated or corrupt windows,
but preserve distributional diversity and tail support.The curriculum should therefore include uniform floors, per-regime quotas or soft constraints, embedding-coverage diagnostics, tail-slice probes, and normal-retention checks. Filtering should be evaluated by whether it preserves useful state diversity, not only by whether it improves average loss faster.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Data diversity, curriculum, and long tail | warning | Shows that a common filter can improve standard benchmarks while hurting underrepresented distribution slices. | Need time-series curricula that report per-regime and tail-tenant metrics, not only aggregate forecasting or anomaly scores. |
| Benchmark level | warning | Demonstrates that ImageNet/COCO can be at odds with cultural-diversity metrics; XM3600 retrieval can miss cultural visual diversity. | Need TSFM benchmark suites with region/regime/tail slices and representation probes analogous to geo-localization. |
| Representation quality | adjacent | Few-shot geo-localization shows that image representations trained on global data preserve country/region information better than English-filtered ones. | Need analogous probes for latent state: regime, topology, tenant, device type, intervention phase, and rare-event context. |
| Data mixture and adaptation | adjacent | Global pretraining followed by English fine-tuning gives a better trade-off than English pretraining followed by global fine-tuning. | Need TSFM tests of broad/diverse pretraining followed by domain-specific fine-tuning versus narrow pretraining followed by broad adaptation. |
Links Into The Wiki
- Dynamic Curriculum Learning For JEPA
- Foundation Time-Series Model Research Agenda
- Time-Series Benchmark Hygiene
- Distribution Priors In Self-Supervised Learning
- Latent-State Time-Series Modeling
- Time-Series Foundation Models
Limitations And Gotchas
- The source is about contrastive VLMs, not language-only LLMs, multivariate time series, event streams, JEPA, or action-conditioned world models.
- The training data is WebLI-derived and not fully public, so reproduction outside Google DeepMind may be limited.
- The evidence is primarily SigLIP-style encoder-only contrastive training, not generative VLMs.
- Cultural diversity is approximated through region, country, language, landmark, object, and income-slice evaluations; it is not a complete definition of culture.
- The paper argues against English-only filtering and benchmark-narrow filtering, not against privacy, safety, PII, deduplication, or quality filters that preserve global support.
- Translation can preserve some global coverage while improving English-prompt usability, but it can also remove culturally meaningful linguistic context.
Open Questions
- What is the time-series equivalent of geo-localization: tenant identification, topology region, device class, regime probe, intervention phase, or environment metadata prediction?
- Can dynamic curriculum learning preserve tail-regime representations while still reducing repeated normal windows?
- Which benchmark slices reveal distributional damage that average forecast error hides?
- Should time-series pretraining start broad and diverse, then fine-tune on target domains, rather than pretraining narrowly and hoping later broad adaptation recovers missing regimes?
- How should quality filters distinguish harmful samples from culturally, regionally, or operationally rare samples?
- Can translation-like normalization in time series, such as unit conversion, channel metadata normalization, or topology remapping, preserve diversity without erasing local context?