-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use object_store:BufWriter instead of put_multipart #9614
Comments
FYI @devinjdangelo @alamb |
Related ticket for cleaning up parallel parquet writer is #9493 |
Can I take this to get familiar with datasource related code?
That's my initial plan after investigation, hope to hear your feedback. :) |
@yyy1000 good luck -- this ticket will require some API exploration / potential changes so it will likely be a bit trickey. I think your suggested plan sounds good. It will be interesting if you can also capture any experience / improvements that would make using |
Your plan sounds good and should be relatively straightforward |
Is your feature request related to a problem or challenge?
Currently in many places we use put_multipart for streaming writes. When writing files smaller than 10MiB this is wasteful, as it performs 3 requests when 1 would suffice.
Describe the solution you'd like
object_store 0.9.1 added https://docs.rs/object_store/latest/object_store/buffered/struct.BufWriter.html which can automatically switch between using Put and PutMultipart based on the amount of data that has been written
Describe alternatives you've considered
We could implement our own adaptive logic in the write path within DF
Additional context
A future version of object_store is likely to significantly change put_multipart, and using BufWriter will limit the impact of this - apache/arrow-rs#5500
The text was updated successfully, but these errors were encountered: