-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype: Public database dumps #1800
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
r? @sgrif |
Thanks for working on this! I have a few more thoughts on the points listed in your PR description
Yeah, Diesel doesn't support
Seems fine, I don't think our test suite is expected to pass with random migrations applied,
I'm not sure I'm understanding this item properly, but if your point is to ensure all columns are listed in the TOML file, I think we can do this one of two ways:
I think the easiest option is the best one here, and removes the need for an API endpoint at all -- we can just point people to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious about the decision to go with CSV output for this though. My initial reaction is that a PG dump (either in SQL or binary format) would be more convenient for all parties, but now I'm wondering if that's actually true.
We can't use pgdump
, since it can only dump full tables. See #630 (comment) for more details.
I'm not sure I'm understanding this item properly, but if your point is to ensure all columns are listed in the TOML file, I think we can do this one of two ways
I mean that we need to verify that this implementation does not leak any data that we don't want to leak. We need to be very careful about that. My suggestion is to reconstruct a CSV dump from information obtained by scraping the API, and then compare it to the CSV generated by the dump code. This way, we can see exactly what additional information we are about to release.
A test that the schema in the TOML file matched the database schema is already in the code, so I know how to do that. :)
Pretty much all the responses to the concerns I raised reference your comments, I don't want to step on your toes. |
My initial thought was copying into tables on a separate schema and dumping that, but that probably wouldn't make things any simpler |
Yeah, I considered that in the comments to #630, but overall I believe it would make things much more difficult. Creating a consistent dump will become more difficult since we won't be able to use |
Based on the feedback, I cleaned the design up a bit. I removed the hacky use of row-level security in favour of SQL row filter expressions in the TOML file. This way, the whole information about what is included in the database dump is in a single place now, and we essentially don't need any changes to the existing code. The I kept all the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far! I like the new design a lot better. I downloaded a production snapshot and ran the task locally, it ran in 2 min 10 seconds. That feels like we should be able to schedule it for once a day whenever our traffic is usually lowest without impacting production query performance. If it does end up being noticeable, we could create a follower db in heroku and query that, and if that isn't viable, we can look into the strategy to use the backups. Querying the live production DB seems like the least complex solution though.
The uploaded tar file contained what I expected:
- no files for entirely private tables like
api_tokens
andbackground_jobs
- files containing all data for entirely public tables like
badges
andcategories
- files containing only data that matches the filter for tables like
crate_owners
- files containing only columns when some columns are public and some are private like
version_downloads
I have not tried re-importing the data; I'm happy to try that to test instructions or a tool for doing so once one of those exists.
Verify that dump-db.toml is correct, i.e. that we don't leak any data we don't want to leak. One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.
I haven't done this, but I have manually looked through the private/public/filter settings and left comments for anything I think should be changed.
I think the easiest option is the best one here, and removes the need for an API endpoint at all -- we can just point people to static.crates.io/database.dump and be done with it.
That sounds great; we should have the date+time the dump was generated in the tar somewhere since it won't be in the filename, so that people know how long ago it was generated (and can bug us if the dumps stop getting generated for some reason)
I also edited the remaining work section in the PR description to add a few items; let me know if you have questions/concerns about anything!
FWIW, having a hot replica for failovers is something I've been wanting to
do anyway so it could be worth having this go to a follower anyway
…On Wed, Aug 21, 2019, 1:40 PM Carol (Nichols || Goulding) < ***@***.***> wrote:
***@***.**** requested changes on this pull request.
Looking good so far! I like the new design a lot better. I downloaded a
production snapshot and ran the task locally, it ran in 2 min 10 seconds.
That feels like we should be able to schedule it for once a day whenever
our traffic is usually lowest without impacting production query
performance. If it does end up being noticeable, we could create a follower
db in heroku and query that, and if that isn't viable, we can look into the
strategy to use the backups
<#630 (comment)>.
Querying the live production DB seems like the least complex solution
though.
The uploaded tar file contained what I expected:
- no files for entirely private tables like api_tokens and
background_jobs
- files containing all data for entirely public tables like badges and
categories
- files containing only data that matches the filter for tables like
crate_owners
- files containing only columns when some columns are public and some
are private like version_downloads
I have not tried re-importing the data; I'm happy to try that to test
instructions or a tool for doing so once one of those exists.
Verify that dump-db.toml is correct, i.e. that we don't leak any data we
don't want to leak. One idea to do so is to reconstruct dumps from the
information available via the API and compare to information in a test dump
in the staging environment. This way we could verify what additional
information will be made public.
I haven't done this, but I have manually looked through the
private/public/filter settings and left comments for anything I think
should be changed.
I think the easiest option is the best one here, and removes the need for
an API endpoint at all -- we can just point people to
static.crates.io/database.dump and be done with it.
That sounds great; we should have the date+time the dump was generated in
the tar somewhere since it won't be in the filename, so that people know
how long ago it was generated (and can bug us if the dumps stop getting
generated for some reason)
I also edited the remaining work section in the PR description to add a
few items; let me know if you have questions/concerns about anything!
------------------------------
In src/tasks/dump_db.rs
<#1800 (comment)>:
> @@ -37,38 +35,97 @@ enum ColumnVisibility {
Public,
}
-/// For each table, map column names to their respective visibility.
-type VisibilityConfig = BTreeMap<String, BTreeMap<String, ColumnVisibility>>;
+#[derive(Clone, Debug, Deserialize)]
+struct TableConfig {
TODO: Add some documentation to this struct and its fields please-- the
documentation you have in dump-db.toml looks great and would be great to
have here too!
------------------------------
In src/tasks/dump_db.rs
<#1800 (comment)>:
> +pub fn dump_db(env: &Environment) -> Result<(), PerformError> {
+ // TODO make path configurable
+ const EXPORT_DIR_TEMPLATE: &str = "/tmp/dump-db/%Y-%m-%d-%H%M%S";
+ let export_dir = PathBuf::from(chrono::Utc::now().format(EXPORT_DIR_TEMPLATE).to_string());
+ std::fs::create_dir_all(&export_dir)?;
+ let visibility_config = toml::from_str(include_str!("dump-db.toml")).unwrap();
+ let database_url = if cfg!(test) {
+ crate::env("TEST_DATABASE_URL")
+ } else {
+ crate::env("DATABASE_URL")
+ };
+ run_psql(&visibility_config, &database_url, &export_dir)?;
+ let tarball = create_tarball(&export_dir)?;
+ let target_name = format!("dumps/{}", tarball.file_name().unwrap().to_str().unwrap());
+ upload_tarball(&tarball, &target_name, &env.uploader)?;
+ // TODO: more robust cleanup
What did you have in mind for more robust cleanup?
------------------------------
In src/tasks/dump-db.toml
<#1800 (comment)>:
> +owner_kind = "public"
+
+[crates.columns]
+id = "public"
+name = "public"
+updated_at = "public"
+created_at = "public"
+downloads = "public"
+description = "public"
+homepage = "public"
+documentation = "public"
+readme = "public"
+textsearchable_index_col = "public"
+license = "public"
+repository = "public"
+max_upload_size = "public"
max_upload_size isn't currently visible through the API; it's people who
have asked for an exception to the 10MB upload limit, so I can see how this
could be private info. Although if you download .crate files that are
>10MB, that's a pretty good indicator.... 🤷♀
------------------------------
In src/tasks/dump-db.toml
<#1800 (comment)>:
> +crates_cnt = "public"
+created_at = "public"
+
+[metadata.columns]
+total_downloads = "public"
+
+[publish_limit_buckets.columns]
+user_id = "private"
+tokens = "private"
+last_refill = "private"
+
+[publish_rate_overrides.columns]
+user_id = "private"
+burst = "private"
+
+[readme_renderings.columns]
The readme_renderings table's data isn't currently exposed anywhere in
the public API. It's not really *sensitive*, but I can't see how it would
be particularly *useful* to anyone... (it's the time at which we last
generated the rendered readme HTML stored in S3) I think we should exclude
all its columns because including it will probably be either useless or
confusing.
------------------------------
In src/tasks/dump-db.toml
<#1800 (comment)>:
> +
+[publish_limit_buckets.columns]
+user_id = "private"
+tokens = "private"
+last_refill = "private"
+
+[publish_rate_overrides.columns]
+user_id = "private"
+burst = "private"
+
+[readme_renderings.columns]
+version_id = "public"
+rendered_at = "public"
+
+[reserved_crate_names.columns]
+name = "public"
The list of reserved_crate_names isn't currently accessible via the API,
but it is accessible in GitHub (set 1
<https://github.com/rust-lang/crates.io/blob/a27c704faa2982ddd75a3dc564da85c0217b950e/migrations/20170305095748_create_reserved_crate_names/up.sql#L6-L12>,
set 2
<https://github.com/rust-lang/crates.io/blob/a27c704faa2982ddd75a3dc564da85c0217b950e/migrations/20170430202433_reserve_windows_crate_names/up.sql#L2-L4>),
so this is fine.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#1800?email_source=notifications&email_token=AALVMK3LG5TGFFIV3XVJ56LQFWR5NA5CNFSM4IKVBHF2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCB7NZ6I#pullrequestreview-276749561>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALVMK7TH3XMV7YZXHUVLJ3QFWR5NANCNFSM4IKVBHFQ>
.
|
@carols10cents Thanks for the review!
What was the size of the resulting tarball?
I don't expect too much of an impact given that the database is relatively small – we don't blow out all the disk caches with the dump, so it should hardly be noticeable. (I don't know the exact size of the database, nor what hardware we are using, nor the exact access patterns, but this is my gut feeling.)
The code that generates the export psql script can be easily modified to also generate an import script, which we can simply include in the tarball.
The tarbal contains a directory named after the timestamp, so this should already be covered. |
138M. The dump of the entire production database snapshot I started with is 151M.
We use heroku standard 1x dynos that have 512MB RAM (for where the task will run), and our database plan is heroku's standard 0, 4GB RAM.
Oh right! That'll do :) |
I went through all the database columns again and tried fetching the information via the API. Here is a list of all columns currently marked as "public" that I could not retrieve via the API:
The columns |
I just hit a rather unexpected rabbit hole. When calling
However, now I need to simultaneously write to stdin and read from stderr, and there doesn't seem to be a standard way of doing this in Rust. Options I can think of:
Any advice on this? I would personally go with approach 2 (spawn a thread). |
This is my preference. There's no specific reason a background job wouldn't be allowed to spawn additional threads (both in general across any lang, and also with swirl specifically). In theory this should probably spawn those threads on the job runner's thread pool, but there's no public API for that in swirl right now, and that would really only be a concern if this were high enough volume that it could overwhelm the worker process. That isn't the case here. |
Since we may already have entries in the `badges` table referring to crates that don't exist, we need to delete these entries before creating the foreign key constraint. This is a destructive operation in the sense that it cannot be reverted, but it should only delete rows that should have been deleted together with the crates they belong to. We have a test to verify that all columns called `crate_id` are actually foreign keys referring to `crates.id`. However, the query for the relevant columns contains the filter `contype = 'f'`, effectively limiting the result to columns that already have foreign key constraints. This change fixes the query to also allow `contype IS NULL`. In addition, I modified the query to only verify tables in the schema `public`. This is useful for an integration test for the database dumps in rust-lang#1800.
src/tasks/dump_db.rs
Outdated
timestamp: &self.timestamp, | ||
crates_io_commit: dotenv::var("HEROKU_SLUG_COMMIT") | ||
.unwrap_or_else(|_| "unknown".to_owned()), | ||
format_version: "0.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming you're envisioning this version number to increase its minor version with any backwards compatible schema change and increase its major version with any breaking schema change? So to give some information about compatibility with previous database dumps without needing to look up changes between git SHAs?
I'm a little concerned about the usefulness of this version number because I can very easily see us forgetting to update this with schema changes :-/ If it's not incremented meaningfully, then it's not going to be particularly useful, and people using the database dumps will need to look up what changed in the git history anyway.
Given that we recommend using the import script that does a destructive import anyway, rather than providing diffs between dumps say, is this going to provide enough value for people using the dumps to justify our maintenance of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this would need a bit more specification to be useful.
I did not have in mind to increase this on every schema change, but rather everytime we change something more fundamentally to the format of the tarball, like adding a fundamentally new file.
The main goal was to allow us to make backwards-incompatible changes, and indicate these in the version number. However, this doesn't really work. If we change the dump in a way that breaks compatibility, people will notice without checking a version number, since there tool will break. And if they can't go back to the old version, this version number does not really help.
So the only way to break compatibility is to provide dumps in the new format at a different URL alongside the dumps in the old format. And that does not require a version number, so I'll remove it again.
To be able to proceed, I would need feedback on the following points:
|
I'm sorry this review is so scattered, I get small bits of time to work on it. I do see your latest commit and your request for specific feedback, and I will get to that next-- what I'm working on now is testing out the latest functionality. Using the code at 7ef85f5 (which includes the schema dump and the README), I ran an export from a local db containing a production snapshot and it took 1 min 58 seconds. The resulting tarball was 136M. Restoring the schema took under a second. Restoring the data took 32 min. I suspect disabling indexes/constraints during import would improve the performance; but I don't think we need to go down the route of dropping all the indexes/constraints then recreating them unless people complain (and not all users of this will be importing into postgresql anyway). I'm currently working on deploying and testing this on our staging instance to check the heroku specific parts. |
Ok, I deployed 620bb51 to staging and enqueued a dump-db job; the resulting tar is available at https://alexcrichton-test.s3.amazonaws.com/db-dump.tar.gz if you want to check it out (there's a small amount of data on https://staging-crates-io.herokuapp.com/ that has been used to test various things, so nothing production sized but also nothing important that we can't reset if we've leaked something). The git commit is in the metadata, uploading to s3 went fine, I actually did the dump-db twice and the file was overwritten correctly. I haven't tested passing arguments to the enqueueing of the dump-db task (but I'm not expecting that to have problems really); I have started the creation of a follower db on production so that we can use the follower once this is deployed. |
I think the way the columns are configured currently is fine as well. I also don't think trying to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment (the idea in this PR's description) is worthwhile. Between the two of us, I think we've done a careful consideration of each column.
(bringing the questions about integration tests from your linked column in)
Yes, I think 1.3 s is worth it.
I think it's much more likely that we'll mess up the schema such that we break the import rather than something about the data; I think it'd be ok to merge this as-is for now and add tests as we break this feature and people complain 🤷♀ That'll ensure there's value in the tests.
I can't think of a scenario where I'd want to run the same test multiple times in parallel; I'm also fine categorizing this as "merge as-is and fix it if folks complain". The remaining concern I have before merging this is documentation! I could potentially be convinced that we should merge this without documentation and do a "soft launch" in production, but my current concern is that we'll forget to come back to document the existence of these database dumps and they won't get used/people will complain that we haven't documented this when they find out it exists. |
@carols10cents Thanks a lot for your reviews! No worries about the scattered reviews – I'm glad you managed to find the time for this big a review at all. I only summarized what answers I'm still waiting for since it was easy to lose track of what has already been answered, not to put pressure on you for quicker answers. I would like to include documentation in this PR, but I first wanted to have everything else finished, to avoid having to rewrite the documentation all the time. Glad we are close to finalizing this. :) |
@carols10cents I've added some documentation now. The link is not very prominent, and I marked the database dumps as "experimental" for the first launch. I'd appreciate some language improvements for the docs by a native speaker. |
Add missing foreign key constraint for columns badges.crate_id. Since we may already have entries in the `badges` table referring to crates that don't exist, we need to delete these entries before creating the foreign key constraint. This is a destructive operation in the sense that it cannot be reverted, but it should only delete rows that should have been deleted together with the crates they belong to. We have a test to verify that all columns called `crate_id` are actually foreign keys referring to `crates.id`. However, the query for the relevant columns contains the filter `contype = 'f'`, effectively limiting the result to columns that already have foreign key constraints. This change fixes the query to also allow `contype IS NULL`. In addition, I modified the query to only verify tables in the schema `public`. This is useful for an integration test for the database dumps in #1800.
☔ The latest upstream changes (presumably #1840) made this pull request unmergeable. Please resolve the merge conflicts. |
@bors r+ I think we're ready to go with this!!! Will work on deploying and setting up the scheduled job to enqueue the dump task once bors merges. I changed the docs to be 24 hours and I'm going to set up the task for once per day; I think that's a good starting place. |
📌 Commit 01a4e98 has been approved by |
Prototype: Public database dumps This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this. This PR adds a background task to create a database dump. The task can be triggered with the `enqueue-job` binary, so it is easy to schedule in production using Heroku Scheduler. ### Testing instructions To create a dump: 1. Start the background worker: cargo run --bin background-worker 1. Trigger a database dump: cargo run --bin enqueue-job dump_db The resulting tarball can be found in `./local_uploads/db-dump.tar.gz`. To re-import the dump 1. Unpack the tarball: tar xzf local_uploads/db-dump.tar.gz 1. Create a new database: createdb test_import_dump 1. Run the Diesel migrations for the new DB: diesel migration run --database-url=postgres:///test_import_dump 1. Import the dump cd DUMP_DIRECTORY psql test_import_dump < import.sql (Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.) ### Author's notes * The background task executes `psql` in a subprocess to actually create the dump. One reason for this approach is its simplicity – the `\copy` convenience command issues a suitable `COPY TO STDOUT` SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel `PgConnection`. There doesn't seem to be a way to run raw SQL with full access to the result. * The unit test to verify that the column visibility information in `dump_db.toml` is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information in `schema.rs`.) ### Remaining work * [x] Address TODOs in the source code. The most significant one is to update the `Uploader` interface to accept streamed data instead of a `Vec<u8>`. Currently the whole database dump needs to be loaded into memory at once. * ~~Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.~~ Decided to only store latest at a known URL * [x] Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump. * [x] Somewhere in the tar file, note the date and time the dump was generated * [x] Verify that `dump-db.toml` is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection. ~~One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.~~ * [x] The code needs some form of integration test. Idea from #1629: exporting some data, then trying to re-import it in a clean database. * [x] Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data. * [x] Rebase and remove commits containing the first implementation * [x] Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?) * [x] Include the commit hash of the crates.io version that created the dump in the tarball
@carols10cents Oh, right, I meant to comment about the backup frequency, but forgot. Of course I'm fine with starting with once a day. I suggested to do it every six hours because we are using a follower DB for the backups, so it shouldn't have any impact on performance, but it might increase the CloudFront bill if people are downloading the dumps more frequently. Thanks a lot for guiding me through this process, I'm glad we got this done! |
☀️ Test successful - checks-travis |
This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this.
This PR adds a background task to create a database dump. The task can be triggered with the
enqueue-job
binary, so it is easy to schedule in production using Heroku Scheduler.Testing instructions
To create a dump:
Start the background worker:
Trigger a database dump:
The resulting tarball can be found in
./local_uploads/db-dump.tar.gz
.To re-import the dump
Unpack the tarball:
Create a new database:
Run the Diesel migrations for the new DB:
Import the dump
(Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.)
Author's notes
The background task executes
psql
in a subprocess to actually create the dump. One reason for this approach is its simplicity – the\copy
convenience command issues a suitableCOPY TO STDOUT
SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a DieselPgConnection
. There doesn't seem to be a way to run raw SQL with full access to the result.The unit test to verify that the column visibility information in
dump_db.toml
is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information inschema.rs
.)Remaining work
Address TODOs in the source code. The most significant one is to update the
Uploader
interface to accept streamed data instead of aVec<u8>
. Currently the whole database dump needs to be loaded into memory at once.Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.Decided to only store latest at a known URLDevise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump.
Somewhere in the tar file, note the date and time the dump was generated
Verify that
dump-db.toml
is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection.One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.The code needs some form of integration test. Idea from Require a verified email after 2019-03-01 00:00 UTC #1629: exporting some data, then trying to re-import it in a clean database.
Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data.
Rebase and remove commits containing the first implementation
Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?)
Include the commit hash of the crates.io version that created the dump in the tarball