You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 27, 2022. It is now read-only.
One idea I was exploring in datafusion-contrib/datafusion-objectstore-s3#54 was implementing the AsyncWrite trait as an abstraction over multi-part upload. Does that seem like an agreeable addition to this crate?
Multi-part uploads are helpful when uploading large files. For example, you can write parquet files one row group at a time, uploading each row groups data as a part (though more likely there is some buffering in between to get good part sizes). This is the approach taken in Arrow C++ S3 FileSystem. In fact, if we could even upload parts in parallel for better throughput in some scenarios (something AWS recommends).
It seems that GCS supports this through their S3-compatible API (docs) and Azure Blob store has some notion of "block blobs" that might be applicable (docs).
The text was updated successfully, but these errors were encountered:
I'm also hoping the object store can support streaming uploads for exactly the same use case you describe @wjones127 👍
I am hoping to dedicate some time to implementing some form of streaming writes in the future, but truthfully it is not high on my TODO list - I'd be happy if someone beats me to it!
One idea I was exploring in datafusion-contrib/datafusion-objectstore-s3#54 was implementing the
AsyncWrite
trait as an abstraction over multi-part upload. Does that seem like an agreeable addition to this crate?Multi-part uploads are helpful when uploading large files. For example, you can write parquet files one row group at a time, uploading each row groups data as a part (though more likely there is some buffering in between to get good part sizes). This is the approach taken in Arrow C++ S3 FileSystem. In fact, if we could even upload parts in parallel for better throughput in some scenarios (something AWS recommends).
It seems that GCS supports this through their S3-compatible API (docs) and Azure Blob store has some notion of "block blobs" that might be applicable (docs).
The text was updated successfully, but these errors were encountered: