Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source::readFully inconsistent with other read methods #139

Closed
fzhinkin opened this issue Jun 12, 2023 · 4 comments · Fixed by #136
Closed

Source::readFully inconsistent with other read methods #139

fzhinkin opened this issue Jun 12, 2023 · 4 comments · Fixed by #136
Assignees
Milestone

Comments

@fzhinkin
Copy link
Collaborator

Most of Source's read* methods don't consume any data from the source if there are not enough bytes to complete an operation. For example, Source::readLong called on a source containing less than 8 bytes will throw the EOFException, and bytes buffered by the source will remain untouched and available for reading.

At the same time, Source::readFully(sink: RawSink, byteCount: Long) will consume data from the source even if the operation is terminated with an exception due to the source containing less than byteCount bytes.

The latter is inconsistent with other read operations and should be fixed (or there should be a reason for it to behave in such a way).

@fzhinkin fzhinkin self-assigned this Jun 12, 2023
@fzhinkin fzhinkin added this to the 0.2.0 milestone Jun 12, 2023
@fzhinkin fzhinkin mentioned this issue Jun 12, 2023
3 tasks
@swankjesse
Copy link

I believe you’re referring to the exhaust ourselves comment here:

  if (size < byteCount) {
    sink.write(this, size) // Exhaust ourselves.
    throw EOFException()
  }

I believe readFully() is significantly different from readLong() etc. because it is safe for kotlinx.io to hold 8 bytes into memory before proceeding with an operation, and unsafe for kotlinx.io to hold an arbitrary number of bytes into memory before proceeding with an operation.

Streaming Source

Suppose I use readFully() to move 10 billion bytes from one streaming source to another, but the source file turns out to be 5 bytes shorter than required. Here’s some potential outcomes:

  1. We transfer 0 bytes then throw an exception.
  2. We transfer some arbitrary number of bytes between 0 and 9,999,999,995, then throw an exception.
  3. We exhaust the source by transferring 9,999,999,995 bytes, then throw an exception.

I believe 1 is bad behavior because it requires us to load almost 10 GiB of data into memory before proceeding. The readFully() function should do streaming. Not streaming the 8 bytes of readLong() one-at-a-time is different here, because the API returns all 8 bytes as a unit.

I believe 2 is bad behavior because it’s non-deterministic. Any internal buffer size should not have user-observable effects. If the internal buffer size was 11 billion bytes vs. 8 KiB, the behavior here is probably quite different.

Therefore I claim that 3 is the least bad of our options. Yes it sucks to waste effort copying data if the overall operation is going to ultimately fail. But that’s decidedly not on the happy-path.

Buffered Source

If the source was instead a Buffer, we don’t have to worry about buffering too much, because we know we’ve already buffered everything. In effect, case 1 above is not a performance problem when the source is a buffer.

I believe that by the Liskov Substitution Principle (LSP) it’s better for Okio to implement the same behavior regardless of the type of the source. Doing LSP ensures that if you use a buffer in tests, and a streaming source in production, your tests remain representative of production behavior.

@fzhinkin
Copy link
Collaborator Author

@swankjesse
Thanks for such a detailed answer! I agree that it's unsafe to leave a buffer with an arbitrary number of bytes after an error. However, there is also readUtf8(bytes) that won’t consume any data from a source’s buffer if there are not enough bytes, leaving a buffer full after throwing an exception.

Should the source be fully consumed by readUtf8 even if there is insufficient data (to stick to the same logic as with readFully)?

@swankjesse
Copy link

Lemme rank some potential goals for an I/O function:

  1. It should be deterministic
  2. It should satisfy LSP
  3. It should not consume an arbitrary amount of resources
  4. It should not have side-effects when it fails

For readLong, we can satisfy all the goals.

For readUtf8, the API design itself prevents us from satisfying goal 3. This is okay! In order to produce a string of the caller’s requested length, we need to have that many bytes in memory simultaneously.

For readFully(RawSink), we can’t satisfy all four. My preference is to satisfy 1, 2, and 3. I acknowledge that this is inconsistent with readUtf8 which satisfies 1, 2, and 4. But I claim that even though readUtf8 can’t meet goal 3, it’s still an important goal to strive for for the other APIs.

@fzhinkin
Copy link
Collaborator Author

After all, it seems fine to have slightly different consumption guarantees for different methods, it just needs to be explicitly described in the documentation.

@fzhinkin fzhinkin linked a pull request Jun 23, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants