-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][FS][Azure] Optimise ObjectAppendStream::DoAppend
in the case of many small appends
#40036
Comments
Is there anybody working on this issue? It's kind of a blocking issues as it makes writing to Azure Blob quite slow. |
@OliLay Does using a Though I agree this should probably be handled transparently inside the Azure FS implementation. @felipecrv |
Thanks, I haven't tried that so far, but I guess this won't fully solve the problem as writing on the stream would still block once the buffer size is exceeded as we call write on the underlying Azure output stream. So I agree that probably a non-blocking write mode + internal buffering (as it is implemented for S3) would be the best approach. |
Agreed! |
I opened a PR which addresses this issue: #43096 |
) ### Rationale for this change See #40036. ### What changes are included in this PR? Write buffering and async writes (similar to what the S3 file system does) in the `ObjectAppendStream` for the Azure file system. With write buffering and async writes, the input scenario creation runtime in the tests (which uses the `ObjectAppendStream` against Azurite) decreased from ~25s (see [here](#40036)) to ~800ms: ``` [ RUN ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt [ OK ] TestAzuriteFileSystem.OpenInputFileMixedReadVsReadAt (787 ms) ``` ### Are these changes tested? Added some tests with background writes enabled and disabled (some were taken from the S3 tests). Everything changed should be covered. ### Are there any user-facing changes? `AzureOptions` now allows for `background_writes` to be set (default: true). No breaking changes. ### Notes - The code in `DoWrite` is very similar to [the code in the S3 FS](https://github.com/apache/arrow/blob/edfa343eeca008513f0300924380e1b187cc976b/cpp/src/arrow/filesystem/s3fs.cc#L1753). Maybe this could be unified? I didn't see this in the scope of the PR though. * GitHub Issue: #40036 Lead-authored-by: Oliver Layer <o.layer@celonis.de> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Issue resolved by pull request 43096 |
Describe the enhancement requested
Optimisation to #38333
Child of #18014
Currently
ObjectAppendStream::DoAppend
callsblock_blob_client_->StageBlock
synchronously meaning that the call toObjectAppendStream::DoAppend
blocks until the data has been successfully written to blob storage. This is very in-efficient for large numbers of small writes.This performance problem is actually quite obvious just in small tests against azurite. The
UploadLines
function used to create test data usesstd::accumulate
and writes the data in one call for performance reasons.With accumulate
without accumulate (4096 separate calls to
ObjectAppendStream::DoAppend
).And this is when testing against
azurite
on localhost so against real blob storage where the latency is going to be much higher the problem will be exacerbated.By comparison the GCS filesystem is able to handle the later approach without performance issues.
Some options to optimise:
block_blob_client_->StageBlock
asynchronously and await all the futures inObjectAppendStream::Flush
.block_blob_client_->StageBlock
.block_blob_client_->StageBlock
.Component(s)
C++
The text was updated successfully, but these errors were encountered: