download every wikipedia edit
This script downloads every revision to wikipedia. It can output the revisions into a sqlalchemy-supported database or into single bz2-zipped csv file.
Revisions are output with the following fields:
id
: the revision id, numericparent_id
: parent revision id (if it exists), numericpage_id
: id of page, numericpage_title
: name of pagepage_ns
: page namespace, numerictimestamp
: timestamp with timezonecontributor_id
: id of contributor, numericcontributor_name
: screen name of contributorcontributor_ip
: ip addresss of contributortext
: complete article text after the revision is applied, string in wikipedia markdowncomment
: comment
The id, timestamp, page_id, page_title, and page_ns cannot be null. All other fields may be null.
System requirements:
- 2gb memory (more is better; requirement varies widely based on configuration)
- python 3 & pip pre-installed
- large storage (requirement varies widely based on configuration)
The storage needs for the output vary by format. Writing to a postgres database will require tens of terabytes of storage, while writing to a single .csv.bzip2 will (at the time of writing) require less than 10 terabytes.
I wrote a blog post on some of the project goals and technical choices.
Installation requires pip, the python package manager. Installation works with either CPython or PyPy.
CPython is the standard Python distribution.
python3 -m pip install git+https://github.com/DominicBurkart/wikipedia-revisions.git
PyPy is an alternate, faster Python distribution.
pypy3 -m pip install git+https://github.com/DominicBurkart/wikipedia-revisions.git
Use --help
to see the available options:
python3 -m wikipedia_revisions.download --help
Note: if using PyPy, you can just substitute pypy3
for python3
for any of these commands, for example:
pypy3 -m wikipedia_revisions.download --help
Output all revisions into a giant bz2-zipped csv, using the dump from a specific date:
python3 -u -m wikipedia_revisions.download --date 20200101
Output to a series of named pipes (posix-based systems only):
python3 -u -m wikipedia_revisions.download --date 20200101 --pipe-dir /path/to/dir
Output to postgres database named "wikipedia_revisions" waiting at localhost port 5432:
python3 -u -m wikipedia_revisions.download --date 20200101 --database
To set the database url:
python3 -u -m wikipedia_revisions.download --date 20200101 --database --database-url postgres://postgres@localhost:5432/wikipedia_revisions
Note: If using PyPy to write to a database, currently only postgres is
supported. With PyPy, any custom database url must point to a postgres
database and start with postgresql+psycopg2cffi
, as in
postgresql+psycopg2cffi:///wikipedia_revisions
.
The above information is sufficient for you to run the program. The information below is useful for tuning performance.
- if you're using an SSD, set
--num-subprocesses
to a higher number (e.g. the number of CPU cores). - this program is I/O heavy and relies on the OS's page cache. Having a few gigabytes of free memory for the cache to use will improve I/O throughput.
- using an SSD provides substantial benefits for this program, by increasing I/O speed and eliminating needle-moving cost.
- if writing to a database stored on an external drive, run the program in a directory on a different drive than the database (and ideally the OS). The wikidump is downloaded into the current directory, so putting them on a different disk than the output database avoids throughput and needle-moving issues.