Auto merge of #1800 - smarnach:dump-db, r=carols10cents

Prototype: Public database dumps This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this. This PR adds a background task to create a database dump. The task can be triggered with the `enqueue-job` binary, so it is easy to schedule in production using Heroku Scheduler. ### Testing instructions To create a dump: 1. Start the background worker: cargo run --bin background-worker 1. Trigger a database dump: cargo run --bin enqueue-job dump_db The resulting tarball can be found in `./local_uploads/db-dump.tar.gz`. To re-import the dump 1. Unpack the tarball: tar xzf local_uploads/db-dump.tar.gz 1. Create a new database: createdb test_import_dump 1. Run the Diesel migrations for the new DB: diesel migration run --database-url=postgres:///test_import_dump 1. Import the dump cd DUMP_DIRECTORY psql test_import_dump < import.sql (Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.) ### Author's notes * The background task executes `psql` in a subprocess to actually create the dump. One reason for this approach is its simplicity – the `\copy` convenience command issues a suitable `COPY TO STDOUT` SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel `PgConnection`. There doesn't seem to be a way to run raw SQL with full access to the result. * The unit test to verify that the column visibility information in `dump_db.toml` is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information in `schema.rs`.) ### Remaining work * [x] Address TODOs in the source code. The most significant one is to update the `Uploader` interface to accept streamed data instead of a `Vec<u8>`. Currently the whole database dump needs to be loaded into memory at once. * ~~Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.~~ Decided to only store latest at a known URL * [x] Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump. * [x] Somewhere in the tar file, note the date and time the dump was generated * [x] Verify that `dump-db.toml` is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection. ~~One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.~~ * [x] The code needs some form of integration test. Idea from #1629: exporting some data, then trying to re-import it in a clean database. * [x] Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data. * [x] Rebase and remove commits containing the first implementation * [x] Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?) * [x] Include the commit hash of the crates.io version that created the dump in the tarball
rust-lang · Sep 30, 2019 · 02d2533 · 02d2533
2 parents 028e033 + 01a4e98
commit 02d2533
Show file tree

Hide file tree

Showing 17 changed files with 920 additions and 24 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -83,12 +83,14 @@ tokio = "0.1"
 hyper = "0.12"
 ctrlc = { version = "3.0", features = ["termination"] }
 indexmap = "1.0.2"
+handlebars = "2.0.1"
 
 [dev-dependencies]
 conduit-test = "0.8"
 hyper-tls = "0.3"
 lazy_static = "1.0"
 tokio-core = "0.1"
+diesel_migrations = { version = "1.3.0", features = ["postgres"] }
 
 [build-dependencies]
 dotenv = "0.11"

diff --git a/app/router.js b/app/router.js
@@ -46,6 +46,7 @@ Router.map(function() {
   this.route('category-slugs', { path: 'category_slugs' });
   this.route('team', { path: '/teams/:team_id' });
   this.route('policies');
+  this.route('data-access');
   this.route('confirm', { path: '/confirm/:email_token' });
 
   this.route('catch-all', { path: '*path' });

diff --git a/app/templates/data-access.hbs b/app/templates/data-access.hbs
@@ -0,0 +1,34 @@
+<div id='crates-heading'>
+  {{svg-jar 'circle-with-i'}}
+  <h1>Accessing the Crates.io Data</h1>
+</div>
+
+<p>
+  There are several ways of accessing the Crates.io data. You should try the
+  options in the order listed.
+</p>
+
+<ol>
+  <li>
+    <b>
+      The <a href='https://github.com/rust-lang/crates.io-index'>crates.io index</a>.
+    </b>
+    This git repository is updated by crates.io, and it is used
+    by Cargo to speed up local dependency resolution. It contains the majority
+    of the data exposed by crates.io and is cheap to clone and get updates.
+  </li>
+  <li>
+    <b>The database dumps (experimental).</b> The dump contains all information
+    exposed by the API in a single download. It is updated every 24 hours.
+    The latest dump is available at the address
+    <a href='https://static.crates.io/db-dump.tar.gz'>https://static.crates.io/db-dump.tar.gz</a>.
+    Information on using the dump is contained in the tarball.
+  </li>
+  <li>
+    <b>Crawl the crates.io API.</b> This should be used as a last resort, and
+    doing so is subject to our {{#link-to 'policies'}}crawling policy{{/link-to}}.
+    If the index and the database dumps do not satisfy your needs, we're happy to
+    discuss solutions that don't require you to crawl the registry.
+    You can email us at <a href="mailto:help@crates.io">help@crates.io</a>.
+  </li>
+</ol>
diff --git a/app/templates/policies.hbs b/app/templates/policies.hbs
@@ -112,15 +112,8 @@
 <h2 id='crawlers'><a href='#crawlers'>Crawlers</a></h2>
 
 <p>
-  Before resorting to crawling crates.io, you should first see if you are able to
-  gather the information you need from the
-  <a href='https://github.com/rust-lang/crates.io-index'>crates.io index</a>,
-  which is a public git repository containing the majority
-  of the information availble through our API.
-
-  If the index does not have the information you need, we're also happy to
-  discuss solutions to your needs that don't require you to crawl the registry.
-  You can email us at <a href="mailto:help@crates.io">help@crates.io</a>.
+  Before resorting to crawling crates.io, please read
+  {{#link-to 'data-access'}}Accessing the Crates.io Data{{/link-to}}.
 </p>
 
 <p>

diff --git a/migrations/2017-10-08-193512_category_trees/up.sql b/migrations/2017-10-08-193512_category_trees/up.sql
@@ -1,5 +1,4 @@
--- Your SQL goes here
-CREATE EXTENSION ltree;
+CREATE EXTENSION IF NOT EXISTS ltree;
 
 -- Create the new column which will represent our category tree.
 -- Fill it with values from `slug` column and then set to non-null

diff --git a/migrations/2019-05-14-165316_index_crate_name_for_like/up.sql b/migrations/2019-05-14-165316_index_crate_name_for_like/up.sql
@@ -1,2 +1,2 @@
-CREATE EXTENSION pg_trgm;
+CREATE EXTENSION IF NOT EXISTS pg_trgm;
 CREATE INDEX index_crates_name_tgrm ON crates USING gin (canon_crate_name(name) gin_trgm_ops);
diff --git a/src/bin/enqueue-job.rs b/src/bin/enqueue-job.rs
@@ -1,17 +1,29 @@
-use cargo_registry::util::{CargoError, CargoResult};
-use cargo_registry::{db, tasks};
-use std::env::args;
-use swirl::Job;
+use cargo_registry::util::{human, CargoError, CargoResult};
+use cargo_registry::{db, env, tasks};
+use diesel::PgConnection;
 
 fn main() -> CargoResult<()> {
     let conn = db::connect_now()?;
+    let mut args = std::env::args().skip(1);
+    match &*args.next().unwrap_or_default() {
+        "update_downloads" => tasks::update_downloads().enqueue(&conn),
+        "dump_db" => {
+            let database_url = args.next().unwrap_or_else(|| env("DATABASE_URL"));
+            let target_name = args
+                .next()
+                .unwrap_or_else(|| String::from("db-dump.tar.gz"));
+            tasks::dump_db(database_url, target_name).enqueue(&conn)
+        }
+        other => Err(human(&format!("Unrecognized job type `{}`", other))),
+    }
+}
 
-    match &*args().nth(1).unwrap_or_default() {
-        "update_downloads" => tasks::update_downloads()
-            .enqueue(&conn)
-            .map_err(|e| CargoError::from_std_error(e))?,
-        other => panic!("Unrecognized job type `{}`", other),
-    };
-
-    Ok(())
+/// Helper to map the `PerformError` returned by `swirl::Job::enqueue()` to a
+/// `CargoError`. Can be removed once `map_err()` isn't needed any more.
+trait Enqueue: swirl::Job {
+    fn enqueue(self, conn: &PgConnection) -> CargoResult<()> {
+        <Self as swirl::Job>::enqueue(self, conn).map_err(|e| CargoError::from_std_error(e))
+    }
 }
+
+impl<J: swirl::Job> Enqueue for J {}
diff --git a/src/tasks.rs b/src/tasks.rs
@@ -1,3 +1,5 @@
+pub mod dump_db;
 mod update_downloads;
 
+pub use dump_db::dump_db;
 pub use update_downloads::update_downloads;