Skip to content

Commit

Permalink
Auto merge of #1800 - smarnach:dump-db, r=carols10cents
Browse files Browse the repository at this point in the history
Prototype: Public database dumps

This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this.

This PR adds a background task to create a database dump. The task can be triggered with the `enqueue-job` binary, so it is easy to schedule in production using Heroku Scheduler.

### Testing instructions

To create a dump:

1. Start the background worker:

        cargo run --bin background-worker

1. Trigger a database dump:

        cargo run --bin enqueue-job dump_db

    The resulting tarball can be found in `./local_uploads/db-dump.tar.gz`.

To re-import the dump

1. Unpack the tarball:

        tar xzf local_uploads/db-dump.tar.gz

1. Create a new database:

        createdb test_import_dump

1. Run the Diesel migrations for the new DB:

        diesel migration run --database-url=postgres:///test_import_dump

1. Import the dump

        cd DUMP_DIRECTORY
        psql test_import_dump < import.sql

(Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.)

### Author's notes

* The background task executes `psql` in a subprocess to actually create the dump. One reason for this approach is its simplicity – the `\copy` convenience command issues a suitable `COPY TO STDOUT` SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel `PgConnection`. There doesn't seem to be a way to run raw SQL with full access to the result.

* The unit test to verify that the column visibility information in `dump_db.toml` is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information in `schema.rs`.)

### Remaining work

* [x] Address TODOs in the source code. The most significant one is to update the `Uploader` interface to accept streamed data instead of a `Vec<u8>`. Currently the whole database dump needs to be loaded into memory at once.

* ~~Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.~~ Decided to only store latest at a known URL

* [x] Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump.

* [x] Somewhere in the tar file, note the date and time the dump was generated

* [x] Verify that `dump-db.toml` is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection. ~~One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.~~

* [x] The code needs some form of integration test. Idea from #1629: exporting some data, then trying to re-import it in a clean database.

* [x] Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data.

* [x] Rebase and remove commits containing the first implementation

* [x] Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?)

* [x] Include the commit hash of the crates.io version that created the dump in the tarball
  • Loading branch information
bors committed Sep 30, 2019
2 parents 028e033 + 01a4e98 commit 02d2533
Show file tree
Hide file tree
Showing 17 changed files with 920 additions and 24 deletions.
57 changes: 57 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -83,12 +83,14 @@ tokio = "0.1"
hyper = "0.12"
ctrlc = { version = "3.0", features = ["termination"] }
indexmap = "1.0.2"
handlebars = "2.0.1"

[dev-dependencies]
conduit-test = "0.8"
hyper-tls = "0.3"
lazy_static = "1.0"
tokio-core = "0.1"
diesel_migrations = { version = "1.3.0", features = ["postgres"] }

[build-dependencies]
dotenv = "0.11"
Expand Down
1 change: 1 addition & 0 deletions app/router.js
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ Router.map(function() {
this.route('category-slugs', { path: 'category_slugs' });
this.route('team', { path: '/teams/:team_id' });
this.route('policies');
this.route('data-access');
this.route('confirm', { path: '/confirm/:email_token' });

this.route('catch-all', { path: '*path' });
Expand Down
34 changes: 34 additions & 0 deletions app/templates/data-access.hbs
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<div id='crates-heading'>
{{svg-jar 'circle-with-i'}}
<h1>Accessing the Crates.io Data</h1>
</div>

<p>
There are several ways of accessing the Crates.io data. You should try the
options in the order listed.
</p>

<ol>
<li>
<b>
The <a href='https://github.com/rust-lang/crates.io-index'>crates.io index</a>.
</b>
This git repository is updated by crates.io, and it is used
by Cargo to speed up local dependency resolution. It contains the majority
of the data exposed by crates.io and is cheap to clone and get updates.
</li>
<li>
<b>The database dumps (experimental).</b> The dump contains all information
exposed by the API in a single download. It is updated every 24 hours.
The latest dump is available at the address
<a href='https://static.crates.io/db-dump.tar.gz'>https://static.crates.io/db-dump.tar.gz</a>.
Information on using the dump is contained in the tarball.
</li>
<li>
<b>Crawl the crates.io API.</b> This should be used as a last resort, and
doing so is subject to our {{#link-to 'policies'}}crawling policy{{/link-to}}.
If the index and the database dumps do not satisfy your needs, we're happy to
discuss solutions that don't require you to crawl the registry.
You can email us at <a href="mailto:help@crates.io">help@crates.io</a>.
</li>
</ol>
11 changes: 2 additions & 9 deletions app/templates/policies.hbs
Original file line number Diff line number Diff line change
Expand Up @@ -112,15 +112,8 @@
<h2 id='crawlers'><a href='#crawlers'>Crawlers</a></h2>

<p>
Before resorting to crawling crates.io, you should first see if you are able to
gather the information you need from the
<a href='https://github.com/rust-lang/crates.io-index'>crates.io index</a>,
which is a public git repository containing the majority
of the information availble through our API.

If the index does not have the information you need, we're also happy to
discuss solutions to your needs that don't require you to crawl the registry.
You can email us at <a href="mailto:help@crates.io">help@crates.io</a>.
Before resorting to crawling crates.io, please read
{{#link-to 'data-access'}}Accessing the Crates.io Data{{/link-to}}.
</p>

<p>
Expand Down
3 changes: 1 addition & 2 deletions migrations/2017-10-08-193512_category_trees/up.sql
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
-- Your SQL goes here
CREATE EXTENSION ltree;
CREATE EXTENSION IF NOT EXISTS ltree;

-- Create the new column which will represent our category tree.
-- Fill it with values from `slug` column and then set to non-null
Expand Down
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
CREATE EXTENSION pg_trgm;
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX index_crates_name_tgrm ON crates USING gin (canon_crate_name(name) gin_trgm_ops);
36 changes: 24 additions & 12 deletions src/bin/enqueue-job.rs
Original file line number Diff line number Diff line change
@@ -1,17 +1,29 @@
use cargo_registry::util::{CargoError, CargoResult};
use cargo_registry::{db, tasks};
use std::env::args;
use swirl::Job;
use cargo_registry::util::{human, CargoError, CargoResult};
use cargo_registry::{db, env, tasks};
use diesel::PgConnection;

fn main() -> CargoResult<()> {
let conn = db::connect_now()?;
let mut args = std::env::args().skip(1);
match &*args.next().unwrap_or_default() {
"update_downloads" => tasks::update_downloads().enqueue(&conn),
"dump_db" => {
let database_url = args.next().unwrap_or_else(|| env("DATABASE_URL"));
let target_name = args
.next()
.unwrap_or_else(|| String::from("db-dump.tar.gz"));
tasks::dump_db(database_url, target_name).enqueue(&conn)
}
other => Err(human(&format!("Unrecognized job type `{}`", other))),
}
}

match &*args().nth(1).unwrap_or_default() {
"update_downloads" => tasks::update_downloads()
.enqueue(&conn)
.map_err(|e| CargoError::from_std_error(e))?,
other => panic!("Unrecognized job type `{}`", other),
};

Ok(())
/// Helper to map the `PerformError` returned by `swirl::Job::enqueue()` to a
/// `CargoError`. Can be removed once `map_err()` isn't needed any more.
trait Enqueue: swirl::Job {
fn enqueue(self, conn: &PgConnection) -> CargoResult<()> {
<Self as swirl::Job>::enqueue(self, conn).map_err(|e| CargoError::from_std_error(e))
}
}

impl<J: swirl::Job> Enqueue for J {}
2 changes: 2 additions & 0 deletions src/tasks.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
pub mod dump_db;
mod update_downloads;

pub use dump_db::dump_db;
pub use update_downloads::update_downloads;
Loading

0 comments on commit 02d2533

Please sign in to comment.