This repository contains data, analytic code, and findings that support portions of the BuzzFeed News article, "We Got Government Data On 20 Years Of Workplace Sexual Harassment Claims. These Charts Break It Down," published Dec. 5, 2017. Please read that article, which contains important context and details, before proceeding.
Anonymized data of sexual harassment charges filed to the U.S. Equal Employment Opportunity Commission (EEOC) were provided by a spokesperson from the commission.
Data includes the following header:
CHARGE_FILING_DATE
CP_SEX
CP_NATIONAL_ORIGIN
CP_DOB
HISPANIC_CP
CP_RACE_STRING
R_NAICS_CODE
R_NAICS_DESCRIPTION
R_NUMBER_OF_EMPLOYEES
R_TYPE
Regarding the data, the following notes were provided by an EEOC spokesperson:
CP_National_Origin: We greatly expanded the national origin options in 2008. Prior to that, this field will most likely be blank or “Other National Origin”.
CP Hispanic_CP: This field is populated with “Y” if the Charging Party has identified themselves as Hispanic but, like the National Origin, this field was added in 2008.
CP_Race_String: Charging Parties may select multiple races that they identify with. This string includes all selected races as a string of codes. See below for code decryption. R_Type: This is the basic type of respondent (Private, State/Local Agency, School, etc.)
Race Codes:
- A — Asian or Pacific Islander - Obsolete
- B — Black or African American
- H — Native Hawaiian or Other Pacific Islander
- I — American Indian or Alaska Native
- N — Unable to Obtain Information from Charging Party
- O — Other Race - Obsolete
- S — Asian
- W — White
- Z — Charging Party Declined to Provide
The industry and sector metrics — on the total workforce, female workforce, and average hourly earnings — use seasonally-adjusted summary data from the Bureau of Labor Statistics (BLS).
No single BLS dataset contains those metrics for every industry and sector. The numbers were chiefly sourced from the Current Employment Statistics survey and Occupational Employment Statistics program, in that order of preference.
The Current Employment Survey data can be accessed through this data portal. The Occupational Employment Statistics data come from the "National industry-specific and by ownership" download here, specifically natsector_M2016_dl.xlsx
.
For one sector (agriculture), workforce gender was sourced from the Current Population Survey.
NAICS sector descriptions came from the Census Bureau, and were supplemented by this guide to "NAICS Supersectors" for the CES data.
This repository uses Python code to process the data. That code can be found in the following two notebooks:
- Merges the economic data described above, and combines it with the NAICS descriptions
- Merges that data with additional, manually curated information needed for the graphics
- Combines the three separate spreadsheets supplied by the EEOC into one data table
- Aggregates the claims by industry and sector
- Merges the industry aggregates with the economic data described above
- Calculates the gender distribution of EEOC claims, to support this passage: "Overall, 83% of the claims were were filed by women, and 15% by men. The remainder did not specify a gender."
- Generates summary data for the article's graphics:
d3_claims_by_industry.csv
andd3_claims_by_sector.csv
Contact Lam Thuy Vo at lam.vo@buzzfeed.com.
Looking for more from BuzzFeed News? Click here for a list of our open-sourced projects, data, and code.