Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Auto merge of #1800 - smarnach:dump-db, r=carols10cents
Prototype: Public database dumps This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this. This PR adds a background task to create a database dump. The task can be triggered with the `enqueue-job` binary, so it is easy to schedule in production using Heroku Scheduler. ### Testing instructions To create a dump: 1. Start the background worker: cargo run --bin background-worker 1. Trigger a database dump: cargo run --bin enqueue-job dump_db The resulting tarball can be found in `./local_uploads/db-dump.tar.gz`. To re-import the dump 1. Unpack the tarball: tar xzf local_uploads/db-dump.tar.gz 1. Create a new database: createdb test_import_dump 1. Run the Diesel migrations for the new DB: diesel migration run --database-url=postgres:///test_import_dump 1. Import the dump cd DUMP_DIRECTORY psql test_import_dump < import.sql (Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.) ### Author's notes * The background task executes `psql` in a subprocess to actually create the dump. One reason for this approach is its simplicity – the `\copy` convenience command issues a suitable `COPY TO STDOUT` SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel `PgConnection`. There doesn't seem to be a way to run raw SQL with full access to the result. * The unit test to verify that the column visibility information in `dump_db.toml` is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information in `schema.rs`.) ### Remaining work * [x] Address TODOs in the source code. The most significant one is to update the `Uploader` interface to accept streamed data instead of a `Vec<u8>`. Currently the whole database dump needs to be loaded into memory at once. * ~~Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.~~ Decided to only store latest at a known URL * [x] Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump. * [x] Somewhere in the tar file, note the date and time the dump was generated * [x] Verify that `dump-db.toml` is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection. ~~One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.~~ * [x] The code needs some form of integration test. Idea from #1629: exporting some data, then trying to re-import it in a clean database. * [x] Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data. * [x] Rebase and remove commits containing the first implementation * [x] Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?) * [x] Include the commit hash of the crates.io version that created the dump in the tarball
- Loading branch information