Skip to content

Refactor storage #643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
May 1, 2020
Merged

Refactor storage #643

merged 38 commits into from
May 1, 2020

Conversation

jyn514
Copy link
Member

@jyn514 jyn514 commented Mar 17, 2020

This could definitely use some more work, but I thought I'd get an initial draft up to see if this is the right approach. Thanks to @pietroalbini for doing the initial work :)

  • Separate storage backends into S3Backend and DatabaseBackend instead of selecting each backend at runtime for each file.
  • Split up add_path_into_database into 3 functions:
    1. add_path_into_database so I don't have to change a bunch of unrelated code. This just delegates to store_all and then turns the returned HashMap into JSON to be stored in the DB.
    2. store_all, which takes the folder to upload and what path to upload it to. store_all reads all the files in the folder into memory, then divides them into batches of size MAX_CONCURRENT_UPLOADS, and forwards those batches on to store_batch
    3. store_batch, which differs based on whether we're uploading to S3 or storing in the database.
  • Added tests for storing and retrieving files. This tests S3 using min.io and Postgres using the same temporary tables we've been using
  • Remove unused move_to_s3 function
  • Don't create a new tokio::Runtime for each batch of files

I tried to reuse code wherever possible so @Mark-Simulacrum 's hard work on simultaneous uploads is still here. However, I don't yet have tests for simultaneous uploads.

@jyn514 jyn514 force-pushed the refactor-storage branch 3 times, most recently from 86ace11 to f8312f1 Compare April 13, 2020 18:07
}

pub(super) fn store_batch(&self, batch: &[Blob]) -> Result<(), Error> {
let mut rt = tokio::runtime::Runtime::new().unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is creating a new tokio runtime every time we store files the best way to do things? Tokio already uses a fairly global state, so that could just be reused

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without creating the runtime, how would I use block_on? Or would I use await instead?

@jyn514
Copy link
Member Author

jyn514 commented Apr 22, 2020

Status update: this has most of the code changes I want but is missing any tests for S3, which kind of defeats the point. It sounds like rusoto_mock is not going to be flexible enough for our needs so the next best thing is minio. We have minio set up with docker-compose already, but it needs to be tied into the test suite somehow, probably by having test::wrapper start a new instance of the docker images.

Given the scale of the task, I plan to make a separate PR so it's not lost in this details of the storage refactor. This should also help with testing on other platforms since developers will no longer need a local postgres database.

@jyn514 jyn514 force-pushed the refactor-storage branch 3 times, most recently from 4d97861 to ee6829f Compare April 25, 2020 06:43
@jyn514 jyn514 changed the title [WIP] Refactor storage Refactor storage Apr 26, 2020
@jyn514 jyn514 changed the title Refactor storage [WIP] Refactor storage Apr 26, 2020
@jyn514
Copy link
Member Author

jyn514 commented Apr 26, 2020

Status update to the status update: it ended up being really easy to add min.io, so I just did it here.

Note that this will require contributors to have min.io running before running tests. They can do this the same way as on CI: docker-compose up -d s3. This is documented in the README.

@jyn514 jyn514 force-pushed the refactor-storage branch from 521adf3 to cfc873c Compare April 26, 2020 02:05
@jyn514 jyn514 changed the title [WIP] Refactor storage Refactor storage Apr 26, 2020
@jyn514
Copy link
Member Author

jyn514 commented Apr 26, 2020

r? @pietroalbini

Copy link
Member

@pietroalbini pietroalbini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general structure of the code looks good, and I left some small feedback on how to organize the code.

This PR is already too big, but in a followup I'd love to see how the storage is initialized changed: instead of having a Storage::new() that picks on its own between S3 and the database, there should be individual constructors for database storage and S3 storage, and the choice of which to instantiate should be moved to the CLI. Then, each function that interacts with the storage should receive its own &Storage, and tests should call env.storage() or env.storage_s3() (preferring the former) to get the prepopulated instance.

jyn514 and others added 21 commits April 28, 2020 13:53
The only difference was between `pub(crate)` and private, so probably
not worth it.
Co-Authored-By: Chase Wilson <buckshot1233@gmail.com>
Co-Authored-By: Chase Wilson <buckshot1233@gmail.com>
This was only used for a one-time migration and is no longer necessary.
`join_all` polls every future when any wakes up
`FuturesOrdered` has fancier tracking of which one triggered the wakeup so it only polls what it needs to
- remove unused imports
- remove unused rusoto_mock dependency
@jyn514 jyn514 force-pushed the refactor-storage branch from a24cf06 to 8268445 Compare April 28, 2020 17:56
@jyn514 jyn514 merged commit 3dd32ec into rust-lang:master May 1, 2020
@jyn514 jyn514 deleted the refactor-storage branch May 1, 2020 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants