Skip to content

DominicBurkart/wikipedia-revisions

Repository files navigation

wikipedia-revisions

download every wikipedia edit

status

This script downloads every revision to wikipedia. It can output the revisions into a sqlalchemy-supported database or into single bz2-zipped csv file.

Revisions are output with the following fields:

  • id: the revision id, numeric
  • parent_id: parent revision id (if it exists), numeric
  • page_id: id of page, numeric
  • page_title: name of page
  • page_ns: page namespace, numeric
  • timestamp: timestamp with timezone
  • contributor_id: id of contributor, numeric
  • contributor_name: screen name of contributor
  • contributor_ip: ip addresss of contributor
  • text: complete article text after the revision is applied, string in wikipedia markdown
  • comment: comment

The id, timestamp, page_id, page_title, and page_ns cannot be null. All other fields may be null.

System requirements:

  • 2gb memory (more is better; requirement varies widely based on configuration)
  • python 3 & pip pre-installed
  • large storage (requirement varies widely based on configuration)

The storage needs for the output vary by format. Writing to a postgres database will require tens of terabytes of storage, while writing to a single .csv.bzip2 will (at the time of writing) require less than 10 terabytes.

I wrote a blog post on some of the project goals and technical choices.

Install

Installation requires pip, the python package manager. Installation works with either CPython or PyPy.

CPython

CPython is the standard Python distribution.

python3 -m pip install git+https://github.com/DominicBurkart/wikipedia-revisions.git

PyPy

PyPy is an alternate, faster Python distribution.

pypy3 -m pip install git+https://github.com/DominicBurkart/wikipedia-revisions.git

Use

Use --help to see the available options:

python3 -m wikipedia_revisions.download --help

Note: if using PyPy, you can just substitute pypy3 for python3 for any of these commands, for example:

pypy3 -m wikipedia_revisions.download --help

Output all revisions into a giant bz2-zipped csv, using the dump from a specific date:

python3 -u -m wikipedia_revisions.download --date 20200101

Output to a series of named pipes (posix-based systems only):

python3 -u -m wikipedia_revisions.download --date 20200101 --pipe-dir /path/to/dir

Output to postgres database named "wikipedia_revisions" waiting at localhost port 5432:

python3 -u -m wikipedia_revisions.download --date 20200101 --database

To set the database url:

python3 -u -m wikipedia_revisions.download --date 20200101 --database --database-url postgres://postgres@localhost:5432/wikipedia_revisions

Note: If using PyPy to write to a database, currently only postgres is supported. With PyPy, any custom database url must point to a postgres database and start with postgresql+psycopg2cffi, as in postgresql+psycopg2cffi:///wikipedia_revisions.

Configuration Notes

The above information is sufficient for you to run the program. The information below is useful for tuning performance.

  • if you're using an SSD, set --num-subprocesses to a higher number (e.g. the number of CPU cores).
  • this program is I/O heavy and relies on the OS's page cache. Having a few gigabytes of free memory for the cache to use will improve I/O throughput.
  • using an SSD provides substantial benefits for this program, by increasing I/O speed and eliminating needle-moving cost.
  • if writing to a database stored on an external drive, run the program in a directory on a different drive than the database (and ideally the OS). The wikidump is downloaded into the current directory, so putting them on a different disk than the output database avoids throughput and needle-moving issues.

About

download every wikipedia edit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages