Skip to content

Wikipedia Representation Gaps

VidaminT edited this page Nov 24, 2025 · 1 revision

Wikipedia Representation Gaps Analysis (2015-2025)

A Hack for LA civic technology project quantifying systemic bias in English Wikipedia biographies


📊 At a Glance

Key Statistics

What We Found:

  • 68.6% of Wikipedia biographies are about men
  • 60% of biographies cover Western subjects (Europe + North America)
  • 10.5× penalty for women even in the most favorable conditions
  • 47pp gender gap unchanged across 40 years of birth cohorts
  • Geographic concentration quadrupled from 2015 to 2025

What It Means:
Wikipedia doesn't just document history — it actively shapes whose stories get told. And right now, those stories overwhelmingly favor Western men in traditionally male-dominated fields.


🎯 Project Overview

This project analyzes 1.1 million English Wikipedia biographies created between 2015 and 2025, using data from the MediaWiki API and Wikidata. We employed advanced statistical methods to move beyond simple observation to mathematical proof of systemic bias.

Why This Matters

Wikipedia is the 5th most-visited website globally and the primary training data for AI systems. When Wikipedia's coverage is biased, that bias:

  • Shapes public knowledge about who matters in history
  • Trains AI models that perpetuate inequality
  • Influences education through ubiquitous citation in schools
  • Defines "notability" in ways that systematically exclude marginalized groups

🔍 Key Findings

1. The Gender Gap That Won't Budge

Gender Distribution

The Numbers:

  • Male: 68.6%
  • Female: 30.8%
  • Other/Non-binary: 0.3%

Despite modest gains since 2015, the fundamental 2:1 male dominance remains entrenched. But the story gets more troubling when we dig deeper.

The "Pipeline Problem" is a Myth

Gender Over Time

A common defense claims gaps will naturally close as younger, more gender-balanced generations enter the historical record. Our birth cohort analysis of 715,000 biographies definitively disproves this:

Birth Cohort Male % Female % Gender Gap
Born 1940s-1950s 72.9% 26.0% +46.9 pp
Born 1960s-1970s 74.0% 25.1% +48.9 pp
Born 1970s-1980s 73.6% 26.4% +47.2 pp
Born 1990s-2000s 73.7% 26.3% +47.4 pp

The gap for people born in the 1990s-2000s — who came of age during #MeToo — is statistically unchanged from those born 40 years earlier.

This proves bias is ongoing, not historical. Generational replacement won't fix the problem because each new cohort replicates the same structural inequality.


2. Wikipedia Mirrors American Gender Politics

Our statistical changepoint detection algorithms identified 2017 and 2023 as years when representation trends fundamentally shifted — mathematical confirmation that these aren't just narratives but detectable structural breaks in the data.

The Timeline:

  • 2015-2016 (Pre-#MeToo): Female representation improving at +3.2 pp/year (p=0.033)
  • 2017-2019 (#MeToo Peak): Accelerated progress, female share rose from 28% → 32%
  • 2020-2025 (Backlash Era): Progress stalled, female share plateaued at ~34%

Even historic "firsts" like Kamala Harris becoming the first female, Black, and South Asian Vice President didn't reverse the trend. Wikipedia representation is reactive to cultural pressure, not independent of it.


3. Occupational Gatekeeping is Extreme — and Gendered

Occupational Distribution

Four fields monopolize ~98% of Wikipedia biographies: Sports, Arts & Culture, Politics & Law, and STEM & Academia. Within these fields, gender gaps vary dramatically:

Gender Gaps by Field (2025):

Occupation Male % Female % Gap Trend
Military 95% 4% +91 pp Frozen (+0.05 pp/yr)
Sports 90% 8% +82 pp Slow progress
Religion 85% 14% +71 pp Frozen (+0.00002 pp/yr)
Business 80% 18% +62 pp Minimal (+0.30 pp/yr)
Politics & Law 75% 24% +51 pp Improving (+1.95 pp/yr)
STEM & Academia 70% 28% +42 pp Steady (+0.85 pp/yr)
Arts & Culture 65% 33% +32 pp Best progress (+1.20 pp/yr)

Where Progress Happens — and Where It Doesn't

Trajectory analysis reveals a clear pattern:

  • Fields that received focused advocacy (Politics during 2018-2020 electoral cycles) show measurable improvement
  • Fields with rigid hierarchical structures (Military, Religion) remain frozen
  • Passive growth won't work — targeted intervention is required

The "Notability" Double Standard

Wikipedia's supposedly neutral "notability" criteria encode historical chauvinism:

  • Military (95% male): Combat exclusion kept women out until 2015. Wikipedia documents this male-dominated past but treats it as neutral history rather than systematic exclusion. Result: Decades of all-male military leadership become "evidence" of greater male notability.

  • Sports (90% male): Women's sports remain underfunded and undercovered by media. Wikipedia's gap mirrors media bias — if ESPN doesn't cover it, there are fewer "reliable sources" to cite.

  • Politics (75% male): Despite record numbers of women running for office, women face higher notability bars paralleling the "likability" penalties female politicians encounter in media coverage.

The common thread: Wikipedia treats the outcomes of historical gender discrimination as inputs to notability decisions. This is structural misogyny laundered through bureaucratic process.


4. Geographic Bias Exports American Standards Worldwide

Continental Distribution

The Stark Reality:

  • Europe + North America: ~60% of biographies
  • Asia: ~26% (but 59% of world population)
  • Africa: ~6% (but 18% of world population)

Location Quotients: Precise Statistical Measures

We calculated Location Quotients (LQ) comparing each region's share of biographies to its share of world population:

Most Over-represented (2025):

  • Oceania: LQ = 5.55 (5.5× over-represented)
  • Europe: LQ = 3.97 (4× over-represented)
  • North America: LQ = 2.81 (2.8× over-represented)

Most Under-represented (2025):

  • Asia: LQ = 0.34 (66% under-represented)
  • Africa: LQ = 0.39 (61% under-represented)

Geographic Gaps

These aren't estimates — they're mathematical measurements proving Western regions receive 3-6× their proportional share while the Global Majority receives only ⅓ to ⅖ of their proportional share.


5. Intersectional Penalties Multiply, Not Add

Intersectional Analysis

Logistic regression analysis reveals how geographic and gender biases compound:

The Privilege Gradient:

  • Male European (baseline): 1.0× likelihood
  • Female European military: 10.5× less likely than male counterparts
  • Female African subjects: ~20× less likely than male European subjects

This exponential penalty means a female scientist from Asia or Africa must achieve far more recognition — in Western media specifically — to meet the same notability threshold as a male European peer.

Why biases multiply:

  1. Source availability bias: Non-Western media doesn't count as "reliable sources"
  2. Language bias: Non-English sources face higher verification burdens
  3. Cultural gatekeeping: Western definitions of "importance" privilege Western institutions

The result: Women from the Global South don't just face the gender penalty OR the geographic penalty — they face both multiplied together.


6. More Content ≠ More Equity

Temporal Trends

Total new biographies rose from ~51,000 (2015) to a peak of 60,000 (2020), then declined 45% post-pandemic. Despite these wild fluctuations, relative proportions of gender and regional representation remained almost perfectly static.

Concentration Indices Prove the Point

We used Herfindahl-Hirschman Index (HHI) to measure concentration:

Occupational Concentration:

  • 2015 HHI: 3,081
  • 2025 HHI: 2,123
  • Change: -31% (improving)

Geographic Concentration:

  • 2015 HHI: 508
  • 2025 HHI: 2,159
  • Change: +325% (worsening dramatically)

Critical insight: Geographic concentration quadrupled over the decade. Adding more biographies made geographic inequality worse, not better. Bias is baked into the system, not just a side effect of insufficient volume.


📐 Our Methodology

Data Collection

  • Source: MediaWiki Action API + Wikidata
  • Scope: 1.1 million English Wikipedia biographies (2015-2025)
  • Attributes: Gender, country/region, occupation, birth year, creation timestamp

Statistical Analysis Methods

1. Interrupted Time Series Analysis

  • Tests whether cultural events (#MeToo, backlash) caused significant trend changes
  • Finding: Pre-#MeToo trend of +3.2 pp/year (p=0.033) proves Wikipedia was already responsive to feminist momentum

2. Changepoint Detection

  • Algorithmically identifies structural breaks in time series
  • Finding: Independent confirmation of 2017 and 2023 as inflection points

3. Location Quotients (LQ)

  • Quantifies regional over/under-representation relative to population
  • Finding: Precise multipliers showing Western regions 3-6× over-represented

4. Concentration Indices (HHI)

  • Measures inequality in occupational and geographic coverage
  • Finding: Geographic concentration quadrupled while occupational concentration improved slightly

5. Logistic Regression (Intersectional Analysis)

  • Predicts biography presence based on gender × occupation × region
  • Finding: Female European military subjects 10.5× less likely than male counterparts

6. Birth Cohort Analysis

  • Compares gender gaps across generational cohorts
  • Finding: 1990s cohort gap (47.4pp) = 1970s cohort gap (47.2pp), disproving "pipeline problem"

💡 What This Means for Wikipedia

The Myth of Neutrality

Wikipedia's most insidious bias isn't overt sexism — it's the claim of objectivity. By treating historical male dominance as neutral fact rather than the product of systematic exclusion, Wikipedia naturalizes gender inequality.

When notability criteria favor fields women were barred from entering, that's not neutral — that's laundering misogyny through bureaucratic process.

American Exceptionalism Exports Bias

English Wikipedia's scale makes US cultural biases about whose lives matter into global defaults. America's unfinished reckoning with gender inequality doesn't just shape domestic coverage — it exports a template of chauvinism that marginalizes women worldwide.

If The New York Times or BBC don't cover someone, they likely won't meet notability criteria — regardless of their impact in their own country. This is cultural imperialism compounding gender bias.


🎬 Recommendations

For Wikipedia Editors & Community

1. Reform Notability Policies

  • Recognize non-Western media as reliable sources
  • Create exceptions for underrepresented regions
  • Challenge the assumption that "historically male-dominated = inherently notable"

2. Target Frozen Fields

  • Religion, Military, Business won't improve without active intervention
  • Organize edit-a-thons specifically for these domains
  • Challenge gatekeeping in WikiProjects

3. Address Intersectional Compounding

  • Prioritize women from underrepresented regions
  • Create mentorship programs for Global South editors
  • Translate notable achievements from non-English sources

4. Reject the "Pipeline" Excuse

  • When the youngest cohort shows the same 47pp gap as their parents' generation, the problem is current policy, not historical legacy
  • Stop waiting for demographic change to solve editorial decisions

For Hack for LA

Immediate Actions:

  1. Host monthly Wikipedia edit-a-thons focused on underrepresented groups
  2. Create editor recruitment campaigns targeting women and Global South communities
  3. Build tools for bias detection in new articles
  4. Partner with universities to integrate Wikipedia editing into coursework

Medium-term Projects:

  1. Develop automated bias alerts that flag articles lacking diverse sourcing
  2. Create translation pipelines for non-English reliable sources
  3. Build dashboards tracking representation metrics in real-time
  4. Advocate to Wikimedia Foundation for policy changes based on this data

Long-term Vision:

  • Expand analysis to other language editions
  • Track AI training data sourced from Wikipedia
  • Document how bias propagates from Wikipedia → AI systems → public knowledge
  • Build coalition with other civic tech organizations addressing algorithmic bias

📚 Technical Resources

Repository

GitHub: hackforla/wikipedia-representation-gaps

Key Files:

  • representation_gaps.md — Complete analytical report
  • README.md — Project documentation and setup
  • notebooks/ — Jupyter notebooks for analysis pipeline
  • outputs/visualizations/ — All charts and graphics

Data Pipeline

Notebook Purpose
00_project_setup.ipynb Project initialization
01_api_seed.ipynb Pull biography page list from Wikipedia
02_enrich_and_normalize.ipynb Map to Wikidata, fetch attributes
03_aggregate_and_qc.ipynb Build monthly aggregates, quality checks
04_visualization.ipynb Create core visualizations
05_statistical_analysis.ipynb ITS, changepoints, LQ, HHI
06_intersectional_analysis.ipynb Logistic regression, odds ratios, cohorts
07_dashboard.ipynb Interactive dashboard

APIs & Data Sources

  • MediaWiki Action API — Page metadata and timestamps
  • Wikidata API — Structured biographical data
  • UN World Population Prospects — Population baselines for Location Quotients

⚠️ Limitations & Caveats

Data Quality

  • Metadata gaps: Gender and occupation tags incomplete, especially for non-Western subjects
  • English-only scope: Analysis limited to English Wikipedia; other language editions may show different patterns
  • Timestamp definition: "Creation year" refers to Wikidata item creation, usually aligned with article publication

Statistical Methods

  • ITS analysis: Could not definitively prove magnitude of #MeToo effect (p > 0.05 for slope changes), though changepoint detection confirmed 2017 break
  • LQ and HHI: Descriptive measures, do not establish causation
  • Intersectional analysis: Focuses on gender × occupation × region; does not capture race, sexuality, disability
  • Cohort analysis: Limited to 715,000 subjects with reliable birth year data (~66% of dataset)

Interpretation

  • Dashboard reflects coverage, not reality: Wikipedia data shows what is written, not the real world
  • Population baselines: Continental shares are approximations; do not adjust for internet access, literacy, age demographics

🙏 Acknowledgments

Project Team: Hack for LA Wikipedia Representation Gaps Initiative

Methods Inspiration:

  • Interrupted time series analysis adapted from public health intervention studies
  • Location Quotient methodology from economic geography literature
  • Intersectional analysis frameworks from critical data studies

Data Sources:

  • Wikimedia Foundation APIs (MediaWiki, Wikidata)
  • UN World Population Prospects

📖 Citation

When citing this work, please use:

Hack for LA. (2025). Wikipedia Representation Gaps Analysis (2015-2025): 
Quantifying Systemic Bias in English Wikipedia Biographies. 
GitHub: 

🔄 Project Status

  • Current Phase: Active development and monthly data refreshes
  • Last Updated: November 2025
  • Next Milestones:
    • Launch interactive public dashboard
    • Publish findings in academic journals
    • Present to Wikimedia Foundation
    • Expand to multilingual analysis

"The future depends on what we do in the present." — Mahatma Gandhi

Let's make Wikipedia reflect the world, not just the privileged parts of it.


This is a living document. Last updated: November 2025

Clone this wiki locally