Skip to content
This repository has been archived by the owner on Jul 27, 2022. It is now read-only.

AsyncWrite over multi-part upload #9

Closed
wjones127 opened this issue May 20, 2022 · 3 comments
Closed

AsyncWrite over multi-part upload #9

wjones127 opened this issue May 20, 2022 · 3 comments

Comments

@wjones127
Copy link
Contributor

One idea I was exploring in datafusion-contrib/datafusion-objectstore-s3#54 was implementing the AsyncWrite trait as an abstraction over multi-part upload. Does that seem like an agreeable addition to this crate?

Multi-part uploads are helpful when uploading large files. For example, you can write parquet files one row group at a time, uploading each row groups data as a part (though more likely there is some buffering in between to get good part sizes). This is the approach taken in Arrow C++ S3 FileSystem. In fact, if we could even upload parts in parallel for better throughput in some scenarios (something AWS recommends).

It seems that GCS supports this through their S3-compatible API (docs) and Azure Blob store has some notion of "block blobs" that might be applicable (docs).

@domodwyer
Copy link
Contributor

I'm also hoping the object store can support streaming uploads for exactly the same use case you describe @wjones127 👍

I am hoping to dedicate some time to implementing some form of streaming writes in the future, but truthfully it is not high on my TODO list - I'd be happy if someone beats me to it!

@alamb
Copy link
Contributor

alamb commented May 20, 2022

@alamb
Copy link
Contributor

alamb commented Jul 26, 2022

I think this was done in apache/arrow-rs#2147 @wjones127

Let me know if there is something this issue is still tracking.

@alamb alamb closed this as completed Jul 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants