Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix AsyncPutWriter (#7991) #7992

Closed
wants to merge 1 commit into from
Closed

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Oct 30, 2023

Which issue does this PR close?

Closes #7991

Rationale for this change

@devinjdangelo perhaps you might be able to help out testing this, I'm honestly a little unclear on how this functionality is wired up / tested. I purely noticed it whilst trying out an object_store upgrade and touching the code

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Oct 30, 2023
@devinjdangelo
Copy link
Contributor

AsyncPutWriter is not used in any execution plans currently (for example see) . The two other modes (append, and put multipart) are used but neither depend on AsyncPutWriter. The only reason I see to use put over multipart is as a performance optimization when writing many small files to avoid the overhead of the additional remote requests to set up the multipart upload and complete it. Writing many small files is generally a bad idea in any case with object stores, which is why I didn't invest any time in this.

Still, it would be good to ensure this method works as intended or is removed if we don't see a use case for it.

@devinjdangelo
Copy link
Contributor

The work on these abstractions was originally completed in #6526. It may be worth revisiting some of those original design discussions now that we have more write related features implemented.

@tustvold
Copy link
Contributor Author

Fair enough, I think the justification would be if you know that the file is going to be less than the 10MB minimum part size, a multipart upload is not going to do anything but add request and buffer copying overheads. That being said it might still help for the local filesystem use-case...

Ultimately I don't have a strong opinion I was just surprised to stumble across this as I couldn't see how it could work 😅

@tustvold
Copy link
Contributor Author

tustvold commented Oct 30, 2023

revisiting some of those original design discussions

Just from a cursory perusal of the code, a lot of the design complexity appears to arise from the append use-case, including things like whether to write headers or not. This also causes pain on the read side. I wonder if we could better encapsulate this streaming use-case, potentially in its own operators, so as to simplify the general design

@devinjdangelo
Copy link
Contributor

devinjdangelo commented Oct 30, 2023

Fair enough, I think the justification would be if you know that the file is going to be less than the 10MB minimum part size, a multipart upload is not going to do anything but add request and buffer copying overheads. That being said it might still help for the local filesystem use-case...

Yeah agreed. My thinking is that if you are writing a few <10MB file performance is almost certainly not an issue. If you are writing dozens or 100s of <=10MB files, you probably just shouldn't do that and instead write a few files in the 256MB-2GB range with multipart put.

Just from a cursory perusal of the code, a lot of the design complexity appears to arise from the append use-case, including things like whether to write headers or not. This also causes pain on the read side. I wonder if we could better encapsulate this streaming use-case, potentially in its own operators, so as to simplify the general design

Yes, agreed. There are currently a few hacky workarounds to allow the streaming and appending code to coexist with some of the parallelization code (such as here).

@tustvold
Copy link
Contributor Author

I've filed #7994 to track splitting out the streaming use-case, PTAL

@tustvold
Copy link
Contributor Author

Closing in favor of #8017

@tustvold tustvold closed this Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AsyncPutWriter::poll_shutdown_inner Incorrect
2 participants