wikipedia-revisions

download every wikipedia edit

This script downloads every revision to wikipedia. It can output the revisions into a sqlalchemy-supported database or into single bz2-zipped csv file.

Revisions are output with the following fields:

id: the revision id, numeric
parent_id: parent revision id (if it exists), numeric
page_id: id of page, numeric
page_title: name of page
page_ns: page namespace, numeric
timestamp: timestamp with timezone
contributor_id: id of contributor, numeric
contributor_name: screen name of contributor
contributor_ip: ip addresss of contributor
text: complete article text after the revision is applied, string in wikipedia markdown
comment: comment

The id, timestamp, page_id, page_title, and page_ns cannot be null. All other fields may be null.

System requirements:

2gb memory (more is better; requirement varies widely based on configuration)
python 3 & pip pre-installed
large storage (requirement varies widely based on configuration)

The storage needs for the output vary by format. Writing to a postgres database will require tens of terabytes of storage, while writing to a single .csv.bzip2 will (at the time of writing) require less than 10 terabytes.

I wrote a blog post on some of the project goals and technical choices.

Install

Installation requires pip, the python package manager. Installation works with either CPython or PyPy.

CPython

CPython is the standard Python distribution.

python3 -m pip install git+https://github.com/DominicBurkart/wikipedia-revisions.git

PyPy

PyPy is an alternate, faster Python distribution.

pypy3 -m pip install git+https://github.com/DominicBurkart/wikipedia-revisions.git

Use

Use --help to see the available options:

python3 -m wikipedia_revisions.download --help

Note: if using PyPy, you can just substitute pypy3 for python3 for any of these commands, for example:

pypy3 -m wikipedia_revisions.download --help

Output all revisions into a giant bz2-zipped csv, using the dump from a specific date:

python3 -u -m wikipedia_revisions.download --date 20200101

Output to a series of named pipes (posix-based systems only):

python3 -u -m wikipedia_revisions.download --date 20200101 --pipe-dir /path/to/dir

Output to postgres database named "wikipedia_revisions" waiting at localhost port 5432:

python3 -u -m wikipedia_revisions.download --date 20200101 --database

To set the database url:

python3 -u -m wikipedia_revisions.download --date 20200101 --database --database-url postgres://postgres@localhost:5432/wikipedia_revisions

Note: If using PyPy to write to a database, currently only postgres is supported. With PyPy, any custom database url must point to a postgres database and start with postgresql+psycopg2cffi, as in postgresql+psycopg2cffi:///wikipedia_revisions.

Configuration Notes

The above information is sufficient for you to run the program. The information below is useful for tuning performance.

if you're using an SSD, set --num-subprocesses to a higher number (e.g. the number of CPU cores).
this program is I/O heavy and relies on the OS's page cache. Having a few gigabytes of free memory for the cache to use will improve I/O throughput.
using an SSD provides substantial benefits for this program, by increasing I/O speed and eliminating needle-moving cost.
if writing to a database stored on an external drive, run the program in a directory on a different drive than the database (and ideally the OS). The wikidump is downloaded into the current directory, so putting them on a different disk than the output database avoids throughput and needle-moving issues.

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
wikipedia_revisions		wikipedia_revisions
LICENSE.txt		LICENSE.txt
README.md		README.md
database_requirements.txt		database_requirements.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikipedia-revisions

Install

CPython

PyPy

Use

Configuration Notes

About

Releases

Packages

Languages

License

DominicBurkart/wikipedia-revisions

Folders and files

Latest commit

History

Repository files navigation

wikipedia-revisions

Install

CPython

PyPy

Use

Configuration Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages