-
-
Notifications
You must be signed in to change notification settings - Fork 20
Wikipedia Representation Gaps
A Hack for LA civic technology project quantifying systemic bias in English Wikipedia biographies

What We Found:
- 68.6% of Wikipedia biographies are about men
- 60% of biographies cover Western subjects (Europe + North America)
- 10.5× penalty for women even in the most favorable conditions
- 47pp gender gap unchanged across 40 years of birth cohorts
- Geographic concentration quadrupled from 2015 to 2025
What It Means:
Wikipedia doesn't just document history — it actively shapes whose stories get told. And right now, those stories overwhelmingly favor Western men in traditionally male-dominated fields.
This project analyzes 1.1 million English Wikipedia biographies created between 2015 and 2025, using data from the MediaWiki API and Wikidata. We employed advanced statistical methods to move beyond simple observation to mathematical proof of systemic bias.
Wikipedia is the 5th most-visited website globally and the primary training data for AI systems. When Wikipedia's coverage is biased, that bias:
- Shapes public knowledge about who matters in history
- Trains AI models that perpetuate inequality
- Influences education through ubiquitous citation in schools
- Defines "notability" in ways that systematically exclude marginalized groups

The Numbers:
- Male: 68.6%
- Female: 30.8%
- Other/Non-binary: 0.3%
Despite modest gains since 2015, the fundamental 2:1 male dominance remains entrenched. But the story gets more troubling when we dig deeper.

A common defense claims gaps will naturally close as younger, more gender-balanced generations enter the historical record. Our birth cohort analysis of 715,000 biographies definitively disproves this:
| Birth Cohort | Male % | Female % | Gender Gap |
|---|---|---|---|
| Born 1940s-1950s | 72.9% | 26.0% | +46.9 pp |
| Born 1960s-1970s | 74.0% | 25.1% | +48.9 pp |
| Born 1970s-1980s | 73.6% | 26.4% | +47.2 pp |
| Born 1990s-2000s | 73.7% | 26.3% | +47.4 pp |
The gap for people born in the 1990s-2000s — who came of age during #MeToo — is statistically unchanged from those born 40 years earlier.
This proves bias is ongoing, not historical. Generational replacement won't fix the problem because each new cohort replicates the same structural inequality.
Our statistical changepoint detection algorithms identified 2017 and 2023 as years when representation trends fundamentally shifted — mathematical confirmation that these aren't just narratives but detectable structural breaks in the data.
The Timeline:
- 2015-2016 (Pre-#MeToo): Female representation improving at +3.2 pp/year (p=0.033)
- 2017-2019 (#MeToo Peak): Accelerated progress, female share rose from 28% → 32%
- 2020-2025 (Backlash Era): Progress stalled, female share plateaued at ~34%
Even historic "firsts" like Kamala Harris becoming the first female, Black, and South Asian Vice President didn't reverse the trend. Wikipedia representation is reactive to cultural pressure, not independent of it.

Four fields monopolize ~98% of Wikipedia biographies: Sports, Arts & Culture, Politics & Law, and STEM & Academia. Within these fields, gender gaps vary dramatically:
Gender Gaps by Field (2025):
| Occupation | Male % | Female % | Gap | Trend |
|---|---|---|---|---|
| Military | 95% | 4% | +91 pp | Frozen (+0.05 pp/yr) |
| Sports | 90% | 8% | +82 pp | Slow progress |
| Religion | 85% | 14% | +71 pp | Frozen (+0.00002 pp/yr) |
| Business | 80% | 18% | +62 pp | Minimal (+0.30 pp/yr) |
| Politics & Law | 75% | 24% | +51 pp | Improving (+1.95 pp/yr) |
| STEM & Academia | 70% | 28% | +42 pp | Steady (+0.85 pp/yr) |
| Arts & Culture | 65% | 33% | +32 pp | Best progress (+1.20 pp/yr) |
Trajectory analysis reveals a clear pattern:
- Fields that received focused advocacy (Politics during 2018-2020 electoral cycles) show measurable improvement
- Fields with rigid hierarchical structures (Military, Religion) remain frozen
- Passive growth won't work — targeted intervention is required
Wikipedia's supposedly neutral "notability" criteria encode historical chauvinism:
-
Military (95% male): Combat exclusion kept women out until 2015. Wikipedia documents this male-dominated past but treats it as neutral history rather than systematic exclusion. Result: Decades of all-male military leadership become "evidence" of greater male notability.
-
Sports (90% male): Women's sports remain underfunded and undercovered by media. Wikipedia's gap mirrors media bias — if ESPN doesn't cover it, there are fewer "reliable sources" to cite.
-
Politics (75% male): Despite record numbers of women running for office, women face higher notability bars paralleling the "likability" penalties female politicians encounter in media coverage.
The common thread: Wikipedia treats the outcomes of historical gender discrimination as inputs to notability decisions. This is structural misogyny laundered through bureaucratic process.

The Stark Reality:
- Europe + North America: ~60% of biographies
- Asia: ~26% (but 59% of world population)
- Africa: ~6% (but 18% of world population)
We calculated Location Quotients (LQ) comparing each region's share of biographies to its share of world population:
Most Over-represented (2025):
- Oceania: LQ = 5.55 (5.5× over-represented)
- Europe: LQ = 3.97 (4× over-represented)
- North America: LQ = 2.81 (2.8× over-represented)
Most Under-represented (2025):
- Asia: LQ = 0.34 (66% under-represented)
- Africa: LQ = 0.39 (61% under-represented)

These aren't estimates — they're mathematical measurements proving Western regions receive 3-6× their proportional share while the Global Majority receives only ⅓ to ⅖ of their proportional share.

Logistic regression analysis reveals how geographic and gender biases compound:
The Privilege Gradient:
- Male European (baseline): 1.0× likelihood
- Female European military: 10.5× less likely than male counterparts
- Female African subjects: ~20× less likely than male European subjects
This exponential penalty means a female scientist from Asia or Africa must achieve far more recognition — in Western media specifically — to meet the same notability threshold as a male European peer.
Why biases multiply:
- Source availability bias: Non-Western media doesn't count as "reliable sources"
- Language bias: Non-English sources face higher verification burdens
- Cultural gatekeeping: Western definitions of "importance" privilege Western institutions
The result: Women from the Global South don't just face the gender penalty OR the geographic penalty — they face both multiplied together.

Total new biographies rose from ~51,000 (2015) to a peak of 60,000 (2020), then declined 45% post-pandemic. Despite these wild fluctuations, relative proportions of gender and regional representation remained almost perfectly static.
We used Herfindahl-Hirschman Index (HHI) to measure concentration:
Occupational Concentration:
- 2015 HHI: 3,081
- 2025 HHI: 2,123
- Change: -31% (improving)
Geographic Concentration:
- 2015 HHI: 508
- 2025 HHI: 2,159
- Change: +325% (worsening dramatically)
Critical insight: Geographic concentration quadrupled over the decade. Adding more biographies made geographic inequality worse, not better. Bias is baked into the system, not just a side effect of insufficient volume.
- Source: MediaWiki Action API + Wikidata
- Scope: 1.1 million English Wikipedia biographies (2015-2025)
- Attributes: Gender, country/region, occupation, birth year, creation timestamp
1. Interrupted Time Series Analysis
- Tests whether cultural events (#MeToo, backlash) caused significant trend changes
- Finding: Pre-#MeToo trend of +3.2 pp/year (p=0.033) proves Wikipedia was already responsive to feminist momentum
2. Changepoint Detection
- Algorithmically identifies structural breaks in time series
- Finding: Independent confirmation of 2017 and 2023 as inflection points
3. Location Quotients (LQ)
- Quantifies regional over/under-representation relative to population
- Finding: Precise multipliers showing Western regions 3-6× over-represented
4. Concentration Indices (HHI)
- Measures inequality in occupational and geographic coverage
- Finding: Geographic concentration quadrupled while occupational concentration improved slightly
5. Logistic Regression (Intersectional Analysis)
- Predicts biography presence based on gender × occupation × region
- Finding: Female European military subjects 10.5× less likely than male counterparts
6. Birth Cohort Analysis
- Compares gender gaps across generational cohorts
- Finding: 1990s cohort gap (47.4pp) = 1970s cohort gap (47.2pp), disproving "pipeline problem"
Wikipedia's most insidious bias isn't overt sexism — it's the claim of objectivity. By treating historical male dominance as neutral fact rather than the product of systematic exclusion, Wikipedia naturalizes gender inequality.
When notability criteria favor fields women were barred from entering, that's not neutral — that's laundering misogyny through bureaucratic process.
English Wikipedia's scale makes US cultural biases about whose lives matter into global defaults. America's unfinished reckoning with gender inequality doesn't just shape domestic coverage — it exports a template of chauvinism that marginalizes women worldwide.
If The New York Times or BBC don't cover someone, they likely won't meet notability criteria — regardless of their impact in their own country. This is cultural imperialism compounding gender bias.
1. Reform Notability Policies
- Recognize non-Western media as reliable sources
- Create exceptions for underrepresented regions
- Challenge the assumption that "historically male-dominated = inherently notable"
2. Target Frozen Fields
- Religion, Military, Business won't improve without active intervention
- Organize edit-a-thons specifically for these domains
- Challenge gatekeeping in WikiProjects
3. Address Intersectional Compounding
- Prioritize women from underrepresented regions
- Create mentorship programs for Global South editors
- Translate notable achievements from non-English sources
4. Reject the "Pipeline" Excuse
- When the youngest cohort shows the same 47pp gap as their parents' generation, the problem is current policy, not historical legacy
- Stop waiting for demographic change to solve editorial decisions
Immediate Actions:
- Host monthly Wikipedia edit-a-thons focused on underrepresented groups
- Create editor recruitment campaigns targeting women and Global South communities
- Build tools for bias detection in new articles
- Partner with universities to integrate Wikipedia editing into coursework
Medium-term Projects:
- Develop automated bias alerts that flag articles lacking diverse sourcing
- Create translation pipelines for non-English reliable sources
- Build dashboards tracking representation metrics in real-time
- Advocate to Wikimedia Foundation for policy changes based on this data
Long-term Vision:
- Expand analysis to other language editions
- Track AI training data sourced from Wikipedia
- Document how bias propagates from Wikipedia → AI systems → public knowledge
- Build coalition with other civic tech organizations addressing algorithmic bias
GitHub: hackforla/wikipedia-representation-gaps
Key Files:
-
representation_gaps.md— Complete analytical report -
README.md— Project documentation and setup -
notebooks/— Jupyter notebooks for analysis pipeline -
outputs/visualizations/— All charts and graphics
| Notebook | Purpose |
|---|---|
00_project_setup.ipynb |
Project initialization |
01_api_seed.ipynb |
Pull biography page list from Wikipedia |
02_enrich_and_normalize.ipynb |
Map to Wikidata, fetch attributes |
03_aggregate_and_qc.ipynb |
Build monthly aggregates, quality checks |
04_visualization.ipynb |
Create core visualizations |
05_statistical_analysis.ipynb |
ITS, changepoints, LQ, HHI |
06_intersectional_analysis.ipynb |
Logistic regression, odds ratios, cohorts |
07_dashboard.ipynb |
Interactive dashboard |
- MediaWiki Action API — Page metadata and timestamps
- Wikidata API — Structured biographical data
- UN World Population Prospects — Population baselines for Location Quotients
- Metadata gaps: Gender and occupation tags incomplete, especially for non-Western subjects
- English-only scope: Analysis limited to English Wikipedia; other language editions may show different patterns
- Timestamp definition: "Creation year" refers to Wikidata item creation, usually aligned with article publication
- ITS analysis: Could not definitively prove magnitude of #MeToo effect (p > 0.05 for slope changes), though changepoint detection confirmed 2017 break
- LQ and HHI: Descriptive measures, do not establish causation
- Intersectional analysis: Focuses on gender × occupation × region; does not capture race, sexuality, disability
- Cohort analysis: Limited to 715,000 subjects with reliable birth year data (~66% of dataset)
- Dashboard reflects coverage, not reality: Wikipedia data shows what is written, not the real world
- Population baselines: Continental shares are approximations; do not adjust for internet access, literacy, age demographics
Project Team: Hack for LA Wikipedia Representation Gaps Initiative
Methods Inspiration:
- Interrupted time series analysis adapted from public health intervention studies
- Location Quotient methodology from economic geography literature
- Intersectional analysis frameworks from critical data studies
Data Sources:
- Wikimedia Foundation APIs (MediaWiki, Wikidata)
- UN World Population Prospects
When citing this work, please use:
Hack for LA. (2025). Wikipedia Representation Gaps Analysis (2015-2025):
Quantifying Systemic Bias in English Wikipedia Biographies.
GitHub:
- Current Phase: Active development and monthly data refreshes
- Last Updated: November 2025
-
Next Milestones:
- Launch interactive public dashboard
- Publish findings in academic journals
- Present to Wikimedia Foundation
- Expand to multilingual analysis
"The future depends on what we do in the present." — Mahatma Gandhi
Let's make Wikipedia reflect the world, not just the privileged parts of it.
This is a living document. Last updated: November 2025