This repository curates quantitative transparency disclosures about the online sexual exploitation of minors, i.e., people under the age of eighteen, in machine-readable form. It also includes a 4,400-line Python library for validating and tidying the data and Python as well as R notebooks with the analysis for the corresponding report Putting the Count Back Into Accountability: An Audit of Social Media Transparency Disclosures, Focusing on Sexual Exploitation of Minors.
Please cite as: Robert Grimm. Diaphanous: Transparency Disclosures About the Sexual Exploitation of Minors. v0.1, Zenodo, 7 Oct. 2024, .
To run the code in this repository, you'll need the following tools:
- According to vermin, the minimmum required Python version is 3.11.
- The analysis/platform.ipynb notebook is written
in Python and R. The necessary bindings are provided by the
rpy2 Python package. The package is installed like
other Python packages as described in the next bullet point. But it does
require a working R installation (e.g.,
brew install r
). - Required Python packages are listed in the repository's
pyproject.toml. The simplest way of installing the
project's dependencies is create a local clone of this repository and then
installing it thusly:
Thanks to the
$ python -m venv .venv # Create virtual environment $ . .venv/bin/activate # Activate virtual environment $ pip install -e . # Install diaphanous as editable
-e
option,pip install
creates a so-called editable install, i.e., it makes the Python code in thediaphanous
package executable without copying it. It also installs all necessary dependencies.
Building the report requires additional tools, i.e., a working LaTeX installation, though the necessary incantations are scripted.
While a few CSV files contain tidy data, others are decidedly untidy with, for example, individual columns combining two variables. The organization of a dataset usually reflects that of the original disclosure and helps ensuring the correctness of data transcription. The Python package includes several examples for how to tidy up such data.
The CyberTipline reports per year dataset captures the number of reports NCMEC received on its CyberTipline since inception in March 1998, largely based on the table included in Appendix A of its 2022 and 2023 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.
Simon Kemp's Digital 2024: Global Overview Report includes statistics on the global number of social media user identities. They are an effective denominator for normalizing the CyberTipline reports per year.
The CyberTipline report contents and recipients dataset breaks down the reports NCMEC received by:
- the category of sexual exploitation, e.g., whether a report concerns child pornography, misleading words/images, online enticement, child sex trafficking, obscene material sent to a child, misleading domain names, child sexual molestation, or child sex tourism;
- the kind of attachments, e.g., photos, videos, or other;
- the uniqueness of attachments as determined by a precise hash (MD5) and a perceptual hash (PhotoDNA, Videntifier);
- their level of detail, i.e., whether they are actionable or only informational;
- their recipients in dedicated units, local, federal, or international law enforcement.
Labels for the uniqueness classification use "unique" for precisely hashed attachments and "similar" for perceptually hashed ones. The dataset combines several tables from NCMEC's 2022 and 2023 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.
The CyberTipline reports per platform dataset is the project's main dataset. It collects:
- disclosures about child sexual exploitation by major non-Chinese social networks and other large service providers;
- corresponding disclosures about service providers' reporting by NCMEC.
The above linked JSON format is automatically generated from a Python module. Both formats have the same structure and contain the same information.
The dataset incorporates information about these platforms:
- Amazon (owns Twitch)
- Apple
- Automattic (owns Tumblr and Wordpress)
- Aylo (née MindGeek)
- Discord
- Facebook (Meta)
- GitHub (Microsoft)
- Google (owns YouTube)
- Instagram (Meta)
- LinkedIn (Microsoft)
- Meta (owns Facebook, Instagram, and WhatsApp)
- Microsoft (owns GitHub and LinkedIn)
- MindGeek (now Aylo)
- Omegle
- Pornhub (Aylo)
- Quora
- Snap
- Telegram
- TikTok
- Tumblr (Automattic)
- Twitch (Amazon)
- Twitter (now X)
- WhatsApp (Meta)
- Wikimedia
- Wordpress (Automattic)
- X (née Twitter)
- YouTube (Google)
Surveyed organizations fall into at least one of the following categories:
- Social media based on Buffer's list of top social media sites,
- Popular platforms based on the European Commission's list of very large online platforms,
- Platforms with considerable reported child sexual exploitation activity based on NCMEC's transparency disclosures.
A separate codebook documents the JSON and Python formats. Basically, they consist of a top-level object that maps organization names to an object with the data about that organization. Since platforms vary widely in what metrics they disclose, the format necessarily is rather generic and collects all of a platform's quantitative disclosures within one table:
-
Since platforms make transparency disclosures for quarter, half, and full years, each table also organizes metrics into time periods with the same granularity.
-
To faithfully capture disclosures, time periods may vary within a table. They may also overlap, both to capture several partial disclosures and to capture several redundant disclosures. A flag clearly marks the latter entries.
-
Where possible, the table uses standard labels for equivalent metrics:
- reports tallies CyberTipline reports to NCMEC;
- pieces tallies instances of CSAM such as pictures and videos;
- accounts tallies user registrations implicated and terminated for CSAM;
Instead of "account termination," many platforms use a euphemism such as "permanent suspension." User registrations thusly impacted are included under accounts. However, temporarily impacted registrations are not.
Comparable CyberTipline report counts and per-provider comparable CyberTipline report counts are materialized views onto the same data. Both views are in long format and only include rows for counts that were disclosed by both electronic service provider and NCMEC.
The latter, more precise view has year, observer, count, and topic columns, with the topic column enabling the grouping of rows with service provider and NCMEC as observers. The former, simplified view has only id, observer, and count columns, with the ID column effectively combining the other view's year and topic columns and the observer column only distinguishing between a generic ServiceProvider and NCMEC.
CyberTipline reports per country collects NCMEC's per-country breakdown of CyberTipline reports for 2019, 2020, 2021, 2022, and 2023 in machine-readable form. The CSV table is mostly straightforward: Its first two columns comprise the country name and ISO three-letter code, followed by a column per year from 2019 through 2022.
To preserve all information from NCMEC's disclosures, the table includes rows for the Netherlands Antilles (ANT), "Europe" (EEE), Bouvet Island (BVT), and "No Country Listed" (no code). NCMEC does not explain its inclusion of Europe in addition to individual European countries nor the Netherlands Antilles in addition to its 2010 successors Bonaire, Sint Eustatius, and Saba (BES), Curaçao (CUW), and Sint Maarten (SXM). Neither do they explain the inclusion of Bouvet Island; the subantarctic dependency of Norway is an uninhabited nature reserve and hence rather unlikely to serve as actual location of internet users.
This repository's Python package includes code that enriches this dataset with population counts, geometries, and region/continent information. It leverages the following data:
- Per-country population counts by the United Nations Population Division;
- Per-country internet user counts prepared by Our World in Data from statistics released by the International Telecommunication Union via WorldBank as well as the United Nations;
- Administrative boundaries for countries by Natural Earth, version 5.1.1;
- Per-country ISO 3166 Alpha-2 and Alpha-3 codes scraped from ISO's website and corresponding region names based on Luke Duncalfe's ISO-3166 dataset.
The following choropleths using the Equal Earth projection visualize CyberTipline reports per year per country per capita:
Discord, Meta, Microsoft, and TikTok have released (some) data in machine-readable form. This dataset contains the corresponding files. Discord's and Meta's data is in CSV format, Microsoft's in Excel format, and TikTok's in Excel and later on CSV format. Meta's and TikTok's files include historical data whereas Discord's and Microsoft's do not. Since Meta re-uses the same URL every quarter, files released before Q2 2022 were retrieved from the Internet Archive's snapshots.
The CSAM pieces by relationship to victim dataset captures the relationship between suspected offenders and victims as determined by law enforcement agencies and tabulated by NCMEC. It is included in NCMEC's 2022 and 2023 transparency reports to the Office for Juvenile Justice and Delinquency Prevention at the Department of Justice.
Since the number of victims in NCMEC's database seems to be very small, I pulled in two more datasets characterizing relationships as well. The first stems from OJJDP's Statistical Briefing Book and covers years 2018 and 2019. The data was originally extracted from the FBI's National Incident-Based Reporting System Master Files. Note that all counts are relative to "typical 1,000 sexual assaults." The second stems from LEARCAT and covers the year 2016. It also draws on the FBI's National Incident-Based Reporting System. While the Briefing Book data is helpful indeed, the choice of relationship bins for the LEARCAT data renders it close to useless in this context.
The data
directory contains a few more tables, including one with global
population sizes also provided by the UN Population
Division and one with Meta's daily and monthly active
people, which captures the number of users
who logged into Facebook, Instagram, Messenger, or WhatsApp at least one over a
day or month. Both tables are used to calculate Meta's daily and monthly active
people as a fraction of the world population.
In addition to the data, this repository also contains the Python code for analyzing it as well as resulting figures. In particular:
- The
analysis
directory contains notebooks with the high-level analysis code. Theindex.ipynb
notebook includes almost all other notebooks. - The
diaphanous
directory contains the Python library code used by the notebooks.- The remaining code in
diaphanous.main
should be refactored into notebooks. - The
show()
function indiaphanous.show
is more generally useful. Most of this functionality should be up-streamed to Pandas because it significantly improves on the default table format.
- The remaining code in
- The
figure
directory contains SVG figures. - The
stubs
directory contains typing stubs. - The
report
directory contains the LaTeX sources for the article discussing the work.
- CSAM: Child Sexual Abuse Material
- CSE: Child Sexual Exploitation
- NCMEC: National Center for Missing and Exploited Children
- OCSE: Online Child Sexual Exploitation
- OJJDP: Office for Juvenile Justice and Delinquency Prevention (at the US Departmet of Justice)
The code in this repository is ©️ 2023–2024 by Robert Grimm and has been released under the Apache 2.0 open source license. The datasets in this repository combine disclosures by electronic service providers as well as the National Center for Missing and Exploited Children (NCMEC) and make this data more easily accessible in machine-readable form. It has been released under the CC BY 4.0 license.