Prototype: Public database dumps #1800

smarnach · 2019-08-09T15:22:01Z

This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this.

This PR adds a background task to create a database dump. The task can be triggered with the enqueue-job binary, so it is easy to schedule in production using Heroku Scheduler.

Testing instructions

To create a dump:

Start the background worker:
```
 cargo run --bin background-worker
```
Trigger a database dump:
```
 cargo run --bin enqueue-job dump_db
```
The resulting tarball can be found in ./local_uploads/db-dump.tar.gz.

To re-import the dump

Unpack the tarball:
```
 tar xzf local_uploads/db-dump.tar.gz
```
Create a new database:
```
 createdb test_import_dump
```

Run the Diesel migrations for the new DB:

 diesel migration run --database-url=postgres:///test_import_dump

Import the dump

 cd DUMP_DIRECTORY
 psql test_import_dump < import.sql

(Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.)

Author's notes

The background task executes psql in a subprocess to actually create the dump. One reason for this approach is its simplicity – the \copy convenience command issues a suitable COPY TO STDOUT SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel PgConnection. There doesn't seem to be a way to run raw SQL with full access to the result.
The unit test to verify that the column visibility information in dump_db.toml is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information in schema.rs.)

Remaining work

Address TODOs in the source code. The most significant one is to update the Uploader interface to accept streamed data instead of a Vec<u8>. Currently the whole database dump needs to be loaded into memory at once.
~~Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.~~ Decided to only store latest at a known URL
Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump.
Somewhere in the tar file, note the date and time the dump was generated
Verify that dump-db.toml is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection. One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.
The code needs some form of integration test. Idea from Require a verified email after 2019-03-01 00:00 UTC #1629: exporting some data, then trying to re-import it in a clean database.
Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data.
Rebase and remove commits containing the first implementation
Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?)
Include the commit hash of the crates.io version that created the dump in the tarball

rust-highfive · 2019-08-09T15:22:13Z

r? @carols10cents

(rust_highfive has picked a reviewer for you, use r? to override)

sgrif · 2019-08-14T20:16:49Z

r? @sgrif

migrations/2019-08-08-191129_add_users_public/up.sql

src/bin/enqueue-job.rs

src/tasks/dump-db.toml

src/tasks/dump_db.rs

src/bin/dump-db.rs

sgrif · 2019-08-14T20:37:36Z

Thanks for working on this! I have a few more thoughts on the points listed in your PR description

Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel PgConnection. There doesn't seem to be a way to run raw SQL with full access to the result.

Yeah, Diesel doesn't support COPY FROM at all. It's very PG specific, and nobody has asked for support yet (thought TBH we could add it pretty easily for this). I am curious about the decision to go with CSV output for this though. My initial reaction is that a PG dump (either in SQL or binary format) would be more convenient for all parties, but now I'm wondering if that's actually true. This might be worth doing some user outreach on.

This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database.

Seems fine, I don't think our test suite is expected to pass with random migrations applied, dotenv 'diesel database reset --database-url=$TEST_DATABASE_URL' is a pretty common command for me to run.

Verify the column visibility configuration and filtering rules. One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.

I'm not sure I'm understanding this item properly, but if your point is to ensure all columns are listed in the TOML file, I think we can do this one of two ways:

Reflect on the schema (yes, Diesel's code for this isn't public, but if you just need a list of table/column names it's a very simple query) and assert that all results are listed in the TOML file
Just ignore it, since we don't want things to be included by default anyway. This would mean we would occasionally add a column and forget to include it in the dump, but I'm honestly not sure this is a common enough occurrence to worry about

Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.

Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump.

I think the easiest option is the best one here, and removes the need for an API endpoint at all -- we can just point people to static.crates.io/database.dump and be done with it.

smarnach

I am curious about the decision to go with CSV output for this though. My initial reaction is that a PG dump (either in SQL or binary format) would be more convenient for all parties, but now I'm wondering if that's actually true.

We can't use pgdump, since it can only dump full tables. See #630 (comment) for more details.

I'm not sure I'm understanding this item properly, but if your point is to ensure all columns are listed in the TOML file, I think we can do this one of two ways

I mean that we need to verify that this implementation does not leak any data that we don't want to leak. We need to be very careful about that. My suggestion is to reconstruct a CSV dump from information obtained by scraping the API, and then compare it to the CSV generated by the dump code. This way, we can see exactly what additional information we are about to release.

A test that the schema in the TOML file matched the database schema is already in the code, so I know how to do that. :)

sgrif · 2019-08-14T22:41:45Z

r? @carols10cents

Pretty much all the responses to the concerns I raised reference your comments, I don't want to step on your toes.

sgrif · 2019-08-14T22:57:54Z

We can't use pgdump, since it can only dump full tables. See #630 (comment) for more details.

My initial thought was copying into tables on a separate schema and dumping that, but that probably wouldn't make things any simpler

smarnach · 2019-08-15T09:31:25Z

My initial thought was copying into tables on a separate schema and dumping that, but that probably wouldn't make things any simpler.

Yeah, I considered that in the comments to #630, but overall I believe it would make things much more difficult. Creating a consistent dump will become more difficult since we won't be able to use ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE anymore, and restoring the data will also become more involved in many cases for various reasons. Overall I think CSV gives us the best balance here.

smarnach · 2019-08-17T21:07:01Z

Based on the feedback, I cleaned the design up a bit. I removed the hacky use of row-level security in favour of SQL row filter expressions in the TOML file. This way, the whole information about what is included in the database dump is in a single place now, and we essentially don't need any changes to the existing code. The dump-db binary is gone, and the additional database role no longer needed.

I kept all the "private" columns in dump-db.toml, so we can still have a unit test that verifies whether the TOML config is up to date.

carols10cents

Looking good so far! I like the new design a lot better. I downloaded a production snapshot and ran the task locally, it ran in 2 min 10 seconds. That feels like we should be able to schedule it for once a day whenever our traffic is usually lowest without impacting production query performance. If it does end up being noticeable, we could create a follower db in heroku and query that, and if that isn't viable, we can look into the strategy to use the backups. Querying the live production DB seems like the least complex solution though.

The uploaded tar file contained what I expected:

no files for entirely private tables like api_tokens and background_jobs
files containing all data for entirely public tables like badges and categories
files containing only data that matches the filter for tables like crate_owners
files containing only columns when some columns are public and some are private like version_downloads

I have not tried re-importing the data; I'm happy to try that to test instructions or a tool for doing so once one of those exists.

Verify that dump-db.toml is correct, i.e. that we don't leak any data we don't want to leak. One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.

I haven't done this, but I have manually looked through the private/public/filter settings and left comments for anything I think should be changed.

I think the easiest option is the best one here, and removes the need for an API endpoint at all -- we can just point people to static.crates.io/database.dump and be done with it.

That sounds great; we should have the date+time the dump was generated in the tar somewhere since it won't be in the filename, so that people know how long ago it was generated (and can bug us if the dumps stop getting generated for some reason)

I also edited the remaining work section in the PR description to add a few items; let me know if you have questions/concerns about anything!

src/tasks/dump_db.rs

src/tasks/dump-db.toml

sgrif · 2019-08-21T20:49:08Z

FWIW, having a hot replica for failovers is something I've been wanting to do anyway so it could be worth having this go to a follower anyway

…

On Wed, Aug 21, 2019, 1:40 PM Carol (Nichols || Goulding) < ***@***.***> wrote: ***@***.**** requested changes on this pull request. Looking good so far! I like the new design a lot better. I downloaded a production snapshot and ran the task locally, it ran in 2 min 10 seconds. That feels like we should be able to schedule it for once a day whenever our traffic is usually lowest without impacting production query performance. If it does end up being noticeable, we could create a follower db in heroku and query that, and if that isn't viable, we can look into the strategy to use the backups <#630 (comment)>. Querying the live production DB seems like the least complex solution though. The uploaded tar file contained what I expected: - no files for entirely private tables like api_tokens and background_jobs - files containing all data for entirely public tables like badges and categories - files containing only data that matches the filter for tables like crate_owners - files containing only columns when some columns are public and some are private like version_downloads I have not tried re-importing the data; I'm happy to try that to test instructions or a tool for doing so once one of those exists. Verify that dump-db.toml is correct, i.e. that we don't leak any data we don't want to leak. One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public. I haven't done this, but I have manually looked through the private/public/filter settings and left comments for anything I think should be changed. I think the easiest option is the best one here, and removes the need for an API endpoint at all -- we can just point people to static.crates.io/database.dump and be done with it. That sounds great; we should have the date+time the dump was generated in the tar somewhere since it won't be in the filename, so that people know how long ago it was generated (and can bug us if the dumps stop getting generated for some reason) I also edited the remaining work section in the PR description to add a few items; let me know if you have questions/concerns about anything! ------------------------------ In src/tasks/dump_db.rs <#1800 (comment)>: > @@ -37,38 +35,97 @@ enum ColumnVisibility { Public, } -/// For each table, map column names to their respective visibility. -type VisibilityConfig = BTreeMap<String, BTreeMap<String, ColumnVisibility>>; +#[derive(Clone, Debug, Deserialize)] +struct TableConfig { TODO: Add some documentation to this struct and its fields please-- the documentation you have in dump-db.toml looks great and would be great to have here too! ------------------------------ In src/tasks/dump_db.rs <#1800 (comment)>: > +pub fn dump_db(env: &Environment) -> Result<(), PerformError> { + // TODO make path configurable + const EXPORT_DIR_TEMPLATE: &str = "/tmp/dump-db/%Y-%m-%d-%H%M%S"; + let export_dir = PathBuf::from(chrono::Utc::now().format(EXPORT_DIR_TEMPLATE).to_string()); + std::fs::create_dir_all(&export_dir)?; + let visibility_config = toml::from_str(include_str!("dump-db.toml")).unwrap(); + let database_url = if cfg!(test) { + crate::env("TEST_DATABASE_URL") + } else { + crate::env("DATABASE_URL") + }; + run_psql(&visibility_config, &database_url, &export_dir)?; + let tarball = create_tarball(&export_dir)?; + let target_name = format!("dumps/{}", tarball.file_name().unwrap().to_str().unwrap()); + upload_tarball(&tarball, &target_name, &env.uploader)?; + // TODO: more robust cleanup What did you have in mind for more robust cleanup? ------------------------------ In src/tasks/dump-db.toml <#1800 (comment)>: > +owner_kind = "public" + +[crates.columns] +id = "public" +name = "public" +updated_at = "public" +created_at = "public" +downloads = "public" +description = "public" +homepage = "public" +documentation = "public" +readme = "public" +textsearchable_index_col = "public" +license = "public" +repository = "public" +max_upload_size = "public" max_upload_size isn't currently visible through the API; it's people who have asked for an exception to the 10MB upload limit, so I can see how this could be private info. Although if you download .crate files that are >10MB, that's a pretty good indicator.... 🤷‍♀ ------------------------------ In src/tasks/dump-db.toml <#1800 (comment)>: > +crates_cnt = "public" +created_at = "public" + +[metadata.columns] +total_downloads = "public" + +[publish_limit_buckets.columns] +user_id = "private" +tokens = "private" +last_refill = "private" + +[publish_rate_overrides.columns] +user_id = "private" +burst = "private" + +[readme_renderings.columns] The readme_renderings table's data isn't currently exposed anywhere in the public API. It's not really *sensitive*, but I can't see how it would be particularly *useful* to anyone... (it's the time at which we last generated the rendered readme HTML stored in S3) I think we should exclude all its columns because including it will probably be either useless or confusing. ------------------------------ In src/tasks/dump-db.toml <#1800 (comment)>: > + +[publish_limit_buckets.columns] +user_id = "private" +tokens = "private" +last_refill = "private" + +[publish_rate_overrides.columns] +user_id = "private" +burst = "private" + +[readme_renderings.columns] +version_id = "public" +rendered_at = "public" + +[reserved_crate_names.columns] +name = "public" The list of reserved_crate_names isn't currently accessible via the API, but it is accessible in GitHub (set 1 <https://github.com/rust-lang/crates.io/blob/a27c704faa2982ddd75a3dc564da85c0217b950e/migrations/20170305095748_create_reserved_crate_names/up.sql#L6-L12>, set 2 <https://github.com/rust-lang/crates.io/blob/a27c704faa2982ddd75a3dc564da85c0217b950e/migrations/20170430202433_reserve_windows_crate_names/up.sql#L2-L4>), so this is fine. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#1800?email_source=notifications&email_token=AALVMK3LG5TGFFIV3XVJ56LQFWR5NA5CNFSM4IKVBHF2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCB7NZ6I#pullrequestreview-276749561>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALVMK7TH3XMV7YZXHUVLJ3QFWR5NANCNFSM4IKVBHFQ> .

smarnach · 2019-08-21T22:28:48Z

@carols10cents Thanks for the review!

I downloaded a production snapshot and ran the task locally, it ran in 2 min 10 seconds.

What was the size of the resulting tarball?

That feels like we should be able to schedule it for once a day whenever our traffic is usually lowest without impacting production query performance.

I don't expect too much of an impact given that the database is relatively small – we don't blow out all the disk caches with the dump, so it should hardly be noticeable. (I don't know the exact size of the database, nor what hardware we are using, nor the exact access patterns, but this is my gut feeling.)

I have not tried re-importing the data; I'm happy to try that to test instructions or a tool for doing so once one of those exists.

The code that generates the export psql script can be easily modified to also generate an import script, which we can simply include in the tarball.

we should have the date+time the dump was generated in the tar somewhere since it won't be in the filename, so that people know how long ago it was generated (and can bug us if the dumps stop getting generated for some reason)

The tarbal contains a directory named after the timestamp, so this should already be covered.

carols10cents · 2019-08-22T15:57:23Z

What was the size of the resulting tarball?

138M. The dump of the entire production database snapshot I started with is 151M.

nor what hardware we are using

We use heroku standard 1x dynos that have 512MB RAM (for where the task will run), and our database plan is heroku's standard 0, 4GB RAM.

The tarbal contains a directory named after the timestamp, so this should already be covered.

Oh right! That'll do :)

smarnach · 2019-08-22T20:28:35Z

I went through all the database columns again and tried fetching the information via the API. Here is a list of all columns currently marked as "public" that I could not retrieve via the API:

categories.path – can be reconstructed by running crates.io locally.
crate_owners.created_at, crate_owners.updated_at – currently not public information, as far as I can tell, but seems safe to publish.
crates.readme, ~~crates.licence~~ – these columns are written when a new crate is published, but then never read again, as far as I can tell. Maybe we should remove these columns?
crates.textsearchable_index_col – currently not public, but could be reconstructed from the crate contents, so seems safe to publish.
crates.max_upload_size – already discussed in the comments above.
reserved_crate_names.name – already discussed in the comments above.
teams.github_id, users.github_id – available from the GitHub API, so already public.

The columns version_downloads.counted and version_downloads.processed are currently marked "private", but it may actually be useful to include them in the dump. This would allow to run the update_downloads task locally after importing the data.

smarnach · 2019-08-25T20:25:02Z

I just hit a rather unexpected rabbit hole. When calling psql, the only way to figure out whether there were errors is to look at the output on stderr, since the exit status of psql is always 0. The output on stdout is irrelevant, so I essentially need

let mut psql = Command::new("psql")
    .arg(database_url)
    .current_dir(export_dir)
    .stdin(Stdio::piped())
    .stdout(Stdio::null())
    .stderr(Stdio::piped())
    .spawn()?;

However, now I need to simultaneously write to stdin and read from stderr, and there doesn't seem to be a standard way of doing this in Rust. Options I can think of:

Ignore the problem and hope that either the psql script or the errors on stderr will be small enough to fit in the pipe buffers, so we will never actually deadlock. Then we could simply first write the input and then read the errors on stderr. There is a good chance that this will actually never deadlock, but I don't really like this approach.
Spawn a thread to read from stderr, and write to stdin in the existing thread. I'm not sure background tasks are supposed to spawn new threads on the production server, since this could lead to memory spikes, but at least this approach is fairly simple.
Use epoll() (probably via mio). Would be rather cumbersome and low-level.
Use tokio_process. Seems like a rather big gun.
Use the subprocess-communicate crate, which looks unmaintained and does not seem to be in widespread use.

Any advice on this? I would personally go with approach 2 (spawn a thread).

sgrif · 2019-08-25T21:17:53Z

Spawn a thread to read from stderr, and write to stdin in the existing thread. I'm not sure background tasks are supposed to spawn new threads on the production server, since this could lead to memory spikes, but at least this approach is fairly simple.

This is my preference. There's no specific reason a background job wouldn't be allowed to spawn additional threads (both in general across any lang, and also with swirl specifically). In theory this should probably spawn those threads on the job runner's thread pool, but there's no public API for that in swirl right now, and that would really only be a concern if this were high enough volume that it could overwhelm the worker process. That isn't the case here.

Since we may already have entries in the `badges` table referring to crates that don't exist, we need to delete these entries before creating the foreign key constraint. This is a destructive operation in the sense that it cannot be reverted, but it should only delete rows that should have been deleted together with the crates they belong to. We have a test to verify that all columns called `crate_id` are actually foreign keys referring to `crates.id`. However, the query for the relevant columns contains the filter `contype = 'f'`, effectively limiting the result to columns that already have foreign key constraints. This change fixes the query to also allow `contype IS NULL`. In addition, I modified the query to only verify tables in the schema `public`. This is useful for an integration test for the database dumps in rust-lang#1800.

src/tasks/dump_db/readme_for_tarball.md

carols10cents · 2019-09-16T14:10:39Z

src/tasks/dump_db.rs

+            timestamp: &self.timestamp,
+            crates_io_commit: dotenv::var("HEROKU_SLUG_COMMIT")
+                .unwrap_or_else(|_| "unknown".to_owned()),
+            format_version: "0.1",


I'm assuming you're envisioning this version number to increase its minor version with any backwards compatible schema change and increase its major version with any breaking schema change? So to give some information about compatibility with previous database dumps without needing to look up changes between git SHAs?

I'm a little concerned about the usefulness of this version number because I can very easily see us forgetting to update this with schema changes :-/ If it's not incremented meaningfully, then it's not going to be particularly useful, and people using the database dumps will need to look up what changed in the git history anyway.

Given that we recommend using the import script that does a destructive import anyway, rather than providing diffs between dumps say, is this going to provide enough value for people using the dumps to justify our maintenance of it?

You are right, this would need a bit more specification to be useful.

I did not have in mind to increase this on every schema change, but rather everytime we change something more fundamentally to the format of the tarball, like adding a fundamentally new file.

The main goal was to allow us to make backwards-incompatible changes, and indicate these in the version number. However, this doesn't really work. If we change the dump in a way that breaks compatibility, people will notice without checking a version number, since there tool will break. And if they can't go back to the old version, this version number does not really help.

So the only way to break compatibility is to provide dumps in the new format at a different URL alongside the dumps in the old format. And that does not require a version number, so I'll remove it again.

smarnach · 2019-09-16T15:09:48Z

To be able to proceed, I would need feedback on the following points:

The summary of additional columns we are about to publish. I personally think these are fine, but I'd like to get a confirmation from someone else on the team as well.
The questions about the integration test, in particular whether we want to keep the test in roughly its current form, and how we should generate test data. I would personally tend to keep the test, since it at least verifies that the export and import script are syntactically corrent, and I would try and use the existing factory functions to generate some rudimentary test data, but I'd keep it minimal, so it does not become too much of a maintenance burden.

carols10cents · 2019-09-16T18:26:06Z

I'm sorry this review is so scattered, I get small bits of time to work on it. I do see your latest commit and your request for specific feedback, and I will get to that next-- what I'm working on now is testing out the latest functionality.

Using the code at 7ef85f5 (which includes the schema dump and the README), I ran an export from a local db containing a production snapshot and it took 1 min 58 seconds. The resulting tarball was 136M.

Restoring the schema took under a second. Restoring the data took 32 min. I suspect disabling indexes/constraints during import would improve the performance; but I don't think we need to go down the route of dropping all the indexes/constraints then recreating them unless people complain (and not all users of this will be importing into postgresql anyway).

I'm currently working on deploying and testing this on our staging instance to check the heroku specific parts.

carols10cents · 2019-09-16T19:56:48Z

Ok, I deployed 620bb51 to staging and enqueued a dump-db job; the resulting tar is available at https://alexcrichton-test.s3.amazonaws.com/db-dump.tar.gz if you want to check it out (there's a small amount of data on https://staging-crates-io.herokuapp.com/ that has been used to test various things, so nothing production sized but also nothing important that we can't reset if we've leaked something).

The git commit is in the metadata, uploading to s3 went fine, I actually did the dump-db twice and the file was overwritten correctly.

I haven't tested passing arguments to the enqueueing of the dump-db task (but I'm not expecting that to have problems really); I have started the creation of a follower db on production so that we can use the follower once this is deployed.

carols10cents · 2019-09-16T20:08:37Z

To be able to proceed, I would need feedback on the following points:

The summary of additional columns we are about to publish. I personally think these are fine, but I'd like to get a confirmation from someone else on the team as well.

I think the way the columns are configured currently is fine as well. I also don't think trying to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment (the idea in this PR's description) is worthwhile. Between the two of us, I think we've done a careful consideration of each column.

The questions about the integration test, in particular whether we want to keep the test in roughly its current form, and how we should generate test data. I would personally tend to keep the test, since it at least verifies that the export and import script are syntactically corrent, and I would try and use the existing factory functions to generate some rudimentary test data, but I'd keep it minimal, so it does not become too much of a maintenance burden.

(bringing the questions about integration tests from your linked column in)

This test is rather slow (about 1.3 seconds wall time on my machine, but will be slower once we add more data). Do we think it adds enough value to keep it?

Yes, I think 1.3 s is worth it.

How should we generate test data to export and re-import? Keeping test data in the repo is cumbersome to maintain, so we probably want to write some code to generate the data. Suggestions welcome.

I think it's much more likely that we'll mess up the schema such that we break the import rather than something about the data; I think it'd be ok to merge this as-is for now and add tests as we break this feature and people complain 🤷‍♀ That'll ensure there's value in the tests.

The temporary schema currently has a fixed name, so running the integration test twice in parallel against the same database will fail. I don't see much of a problem with this, but it's worth pointing out. It would be easy to randomize the name of the schema if we think this is an issue.

I can't think of a scenario where I'd want to run the same test multiple times in parallel; I'm also fine categorizing this as "merge as-is and fix it if folks complain".

The remaining concern I have before merging this is documentation! I could potentially be convinced that we should merge this without documentation and do a "soft launch" in production, but my current concern is that we'll forget to come back to document the existence of these database dumps and they won't get used/people will complain that we haven't documented this when they find out it exists.

smarnach · 2019-09-16T20:43:53Z

@carols10cents Thanks a lot for your reviews! No worries about the scattered reviews – I'm glad you managed to find the time for this big a review at all. I only summarized what answers I'm still waiting for since it was easy to lose track of what has already been answered, not to put pressure on you for quicker answers.

I would like to include documentation in this PR, but I first wanted to have everything else finished, to avoid having to rewrite the documentation all the time. Glad we are close to finalizing this. :)

smarnach · 2019-09-17T11:34:09Z

@carols10cents I've added some documentation now. The link is not very prominent, and I marked the database dumps as "experimental" for the first launch.

I'd appreciate some language improvements for the docs by a native speaker.

Add missing foreign key constraint for columns badges.crate_id. Since we may already have entries in the `badges` table referring to crates that don't exist, we need to delete these entries before creating the foreign key constraint. This is a destructive operation in the sense that it cannot be reverted, but it should only delete rows that should have been deleted together with the crates they belong to. We have a test to verify that all columns called `crate_id` are actually foreign keys referring to `crates.id`. However, the query for the relevant columns contains the filter `contype = 'f'`, effectively limiting the result to columns that already have foreign key constraints. This change fixes the query to also allow `contype IS NULL`. In addition, I modified the query to only verify tables in the schema `public`. This is useful for an integration test for the database dumps in #1800.

bors · 2019-09-18T18:00:57Z

☔ The latest upstream changes (presumably #1840) made this pull request unmergeable. Please resolve the merge conflicts.

carols10cents · 2019-09-30T17:26:50Z

@bors r+

I think we're ready to go with this!!! Will work on deploying and setting up the scheduled job to enqueue the dump task once bors merges. I changed the docs to be 24 hours and I'm going to set up the task for once per day; I think that's a good starting place.

bors · 2019-09-30T17:26:51Z

📌 Commit 01a4e98 has been approved by carols10cents

bors · 2019-09-30T17:26:59Z

⌛ Testing commit 01a4e98 with merge 02d2533...

Prototype: Public database dumps This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this. This PR adds a background task to create a database dump. The task can be triggered with the `enqueue-job` binary, so it is easy to schedule in production using Heroku Scheduler. ### Testing instructions To create a dump: 1. Start the background worker: cargo run --bin background-worker 1. Trigger a database dump: cargo run --bin enqueue-job dump_db The resulting tarball can be found in `./local_uploads/db-dump.tar.gz`. To re-import the dump 1. Unpack the tarball: tar xzf local_uploads/db-dump.tar.gz 1. Create a new database: createdb test_import_dump 1. Run the Diesel migrations for the new DB: diesel migration run --database-url=postgres:///test_import_dump 1. Import the dump cd DUMP_DIRECTORY psql test_import_dump < import.sql (Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.) ### Author's notes * The background task executes `psql` in a subprocess to actually create the dump. One reason for this approach is its simplicity – the `\copy` convenience command issues a suitable `COPY TO STDOUT` SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel `PgConnection`. There doesn't seem to be a way to run raw SQL with full access to the result. * The unit test to verify that the column visibility information in `dump_db.toml` is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information in `schema.rs`.) ### Remaining work * [x] Address TODOs in the source code. The most significant one is to update the `Uploader` interface to accept streamed data instead of a `Vec<u8>`. Currently the whole database dump needs to be loaded into memory at once. * ~~Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.~~ Decided to only store latest at a known URL * [x] Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump. * [x] Somewhere in the tar file, note the date and time the dump was generated * [x] Verify that `dump-db.toml` is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection. ~~One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.~~ * [x] The code needs some form of integration test. Idea from #1629: exporting some data, then trying to re-import it in a clean database. * [x] Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data. * [x] Rebase and remove commits containing the first implementation * [x] Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?) * [x] Include the commit hash of the crates.io version that created the dump in the tarball

smarnach · 2019-09-30T17:41:59Z

@carols10cents Oh, right, I meant to comment about the backup frequency, but forgot. Of course I'm fine with starting with once a day.

I suggested to do it every six hours because we are using a follower DB for the backups, so it shouldn't have any impact on performance, but it might increase the CloudFront bill if people are downloading the dumps more frequently.

Thanks a lot for guiding me through this process, I'm glad we got this done!

bors · 2019-09-30T17:42:01Z

☀️ Test successful - checks-travis
Approved by: carols10cents
Pushing 02d2533 to master...

rust-highfive assigned carols10cents Aug 9, 2019

rust-highfive added the S-waiting-on-review label Aug 9, 2019

rust-highfive assigned sgrif and unassigned carols10cents Aug 14, 2019