Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): sets default chunk_size and sends buffer > chunk_size directly #4710

Merged
merged 11 commits into from
Jun 12, 2024

Conversation

evenyag
Copy link
Contributor

@evenyag evenyag commented Jun 7, 2024

What's changed

Currently, CompleteLayer sanitizes the chunk size when users call with_chunk() explicitly.
https://docs.rs/opendal/latest/opendal/raw/struct.OpWrite.html#method.with_chunk

If users don't provide a chunk size, the writer won't guarantee the body size is larger than the minimum multi-part upload size. It's easy to forget to set the correct chunk size for the writer.

This PR changes this behavior:

  • Uses write_multi_min_size or write_multi_align_size as the chunk size if users don't provide one.
  • The ChunkedWriter attempts to send the bs directly if it is larger than the chunk size
    • It adds a new argument exact to the ChunkedWriter. If exact is true, the writer still ensures the upload body size is exactly the chunk size.
    • Backends with write_multi_align_size will set exact to true.
    • If users provide a chunk size, exact will also be true.

There is a breaking change to ChunkedWriter::new().

Reference

GreptimeTeam/greptimedb#4098 (comment)

@evenyag
Copy link
Contributor Author

evenyag commented Jun 7, 2024

The docs CI for the main branch is broken.
https://github.com/apache/opendal/actions/workflows/docs.yml?query=branch%3Amain

I guess the failure of this check is unrelated to this PR.
https://github.com/apache/opendal/actions/runs/9419578266/job/25949653713?pr=4710

---- behavior::test_rename_nested ----
test panicked: read must succeed: Unexpected (persistent) at  => reader got too little data

Context:
   expect: 64912
   actual: 0

@evenyag evenyag marked this pull request as ready for review June 7, 2024 16:10
@evenyag evenyag requested a review from Xuanwo as a code owner June 7, 2024 16:10
@Xuanwo
Copy link
Member

Xuanwo commented Jun 7, 2024

Thanks for you PR, I will take a review this weekend 🙏

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your PR! This change affects our behavior of chunk in write process. Would you like to update the comments for chunk too? Like:

/// ## `chunk`
///
/// Set `chunk` for the writer.
///
/// OpenDAL is designed to write files directly without chunking by default, giving users
/// control over the exact size of their writes and helping avoid unnecessary costs.
///
/// This is not efficient for cases when users write small chunks of data. Some storage services
/// like `s3` could even return hard errors like `EntityTooSmall`. Besides, cloud storage services
/// will cost more money if we write data in small chunks.
///
/// Set a good chunk size might improve the performance, reduce the API calls and save money.
///
/// The following example will set the writer chunk to 8MiB. Only one API call will be sent at
/// `close` instead.

pub struct ChunkedWriter<W: oio::Write> {
inner: W,

/// The size for buffer, we will flush the underlying storage at the size of this buffer.
chunk_size: usize,
/// If `exact` is true, the size of the data written to the underlying storage is
/// exactly `chunk_size` bytes.
exact: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea about introducing exact.

@@ -49,6 +55,12 @@ impl<W: oio::Write> oio::Write for ChunkedWriter<W> {
self.buffer.advance(written);
}

if !self.exact && bs.len() >= self.chunk_size && self.buffer.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can calculate if bs.len() + self.buffer.len() >= self.chunk_size and write them at once. This could be helpful for cases like: write(1KiB), write(16MiB), write(1KiB).

@evenyag
Copy link
Contributor Author

evenyag commented Jun 11, 2024

Would you like to update the comments for chunk too

Fixed in d6a0449

@evenyag evenyag requested a review from Xuanwo June 11, 2024 12:02
@evenyag
Copy link
Contributor Author

evenyag commented Jun 11, 2024

Some tests failed. I'll take a look.

@evenyag evenyag marked this pull request as draft June 11, 2024 12:32
@evenyag evenyag marked this pull request as ready for review June 11, 2024 14:55
});
size
});
let exact = capability.write_multi_align_size.is_some();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I just noticed that exact doesn't behave as I expected. If user has configured one or write_multi_align_size is some, we should use exact. Some storage services requires to use the exactly the same chunk for uploading, like R2

Object part sizes must be at least 5MiB but no larger than 5GiB. All parts except the last one must be the same size.

self.buffer.advance(written);
// We didn't sent `bs`.
return Ok(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write returning 0 often been treated as an error WriteZero. It's better not to do this.

We can push only 1B in the buffer.

This case is extremely unusual, and almost no other service behaves this way. Those rules are just a safety measure. We can make it clear in the comments.

Copy link
Contributor Author

@evenyag evenyag Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about pushing written bytes to the buffer queue? We already sent written bytes so the queue should have "written" bytes space left.

I'll push a commit later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@Xuanwo Xuanwo merged commit 27c6a77 into apache:main Jun 12, 2024
214 checks passed
@evenyag evenyag deleted the feat/default-chunk branch June 13, 2024 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants