-
Notifications
You must be signed in to change notification settings - Fork 211
Refactor storage #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor storage #643
Conversation
86ace11
to
f8312f1
Compare
src/storage/s3.rs
Outdated
} | ||
|
||
pub(super) fn store_batch(&self, batch: &[Blob]) -> Result<(), Error> { | ||
let mut rt = tokio::runtime::Runtime::new().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is creating a new tokio runtime every time we store files the best way to do things? Tokio already uses a fairly global state, so that could just be reused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without creating the runtime, how would I use block_on
? Or would I use await
instead?
12dea4a
to
57733d8
Compare
a11d818
to
a6e75b8
Compare
Status update: this has most of the code changes I want but is missing any tests for S3, which kind of defeats the point. It sounds like rusoto_mock is not going to be flexible enough for our needs so the next best thing is minio. We have minio set up with docker-compose already, but it needs to be tied into the test suite somehow, probably by having Given the scale of the task, I plan to make a separate PR so it's not lost in this details of the storage refactor. This should also help with testing on other platforms since developers will no longer need a local postgres database. |
4d97861
to
ee6829f
Compare
Status update to the status update: it ended up being really easy to add min.io, so I just did it here. Note that this will require contributors to have min.io running before running tests. They can do this the same way as on CI: |
521adf3
to
cfc873c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general structure of the code looks good, and I left some small feedback on how to organize the code.
This PR is already too big, but in a followup I'd love to see how the storage is initialized changed: instead of having a Storage::new()
that picks on its own between S3 and the database, there should be individual constructors for database storage and S3 storage, and the choice of which to instantiate should be moved to the CLI. Then, each function that interacts with the storage should receive its own &Storage
, and tests should call env.storage()
or env.storage_s3()
(preferring the former) to get the prepopulated instance.
The only difference was between `pub(crate)` and private, so probably not worth it.
Co-Authored-By: Chase Wilson <buckshot1233@gmail.com>
Co-Authored-By: Chase Wilson <buckshot1233@gmail.com>
This was only used for a one-time migration and is no longer necessary.
`join_all` polls every future when any wakes up `FuturesOrdered` has fancier tracking of which one triggered the wakeup so it only polls what it needs to
a24cf06
to
8268445
Compare
This could definitely use some more work, but I thought I'd get an initial draft up to see if this is the right approach. Thanks to @pietroalbini for doing the initial work :)S3Backend
andDatabaseBackend
instead of selecting each backend at runtime for each file.add_path_into_database
into 3 functions:add_path_into_database
so I don't have to change a bunch of unrelated code. This just delegates tostore_all
and then turns the returned HashMap into JSON to be stored in the DB.store_all
, which takes the folder to upload and what path to upload it to.store_all
reads all the files in the folder into memory, then divides them into batches of sizeMAX_CONCURRENT_UPLOADS
, and forwards those batches on tostore_batch
store_batch
, which differs based on whether we're uploading to S3 or storing in the database.min.io
and Postgres using the same temporary tables we've been usingmove_to_s3
functiontokio::Runtime
for each batch of filesI tried to reuse code wherever possible so @Mark-Simulacrum 's hard work on simultaneous uploads is still here.
However, I don't yet have tests for simultaneous uploads.