Source::readFully inconsistent with other read methods #139

fzhinkin · 2023-06-12T14:18:44Z

Most of Source's read* methods don't consume any data from the source if there are not enough bytes to complete an operation. For example, Source::readLong called on a source containing less than 8 bytes will throw the EOFException, and bytes buffered by the source will remain untouched and available for reading.

At the same time, Source::readFully(sink: RawSink, byteCount: Long) will consume data from the source even if the operation is terminated with an exception due to the source containing less than byteCount bytes.

The latter is inconsistent with other read operations and should be fixed (or there should be a reason for it to behave in such a way).

The text was updated successfully, but these errors were encountered:

swankjesse · 2023-06-13T10:23:31Z

I believe you’re referring to the exhaust ourselves comment here:

  if (size < byteCount) {
    sink.write(this, size) // Exhaust ourselves.
    throw EOFException()
  }

I believe readFully() is significantly different from readLong() etc. because it is safe for kotlinx.io to hold 8 bytes into memory before proceeding with an operation, and unsafe for kotlinx.io to hold an arbitrary number of bytes into memory before proceeding with an operation.

Streaming Source

Suppose I use readFully() to move 10 billion bytes from one streaming source to another, but the source file turns out to be 5 bytes shorter than required. Here’s some potential outcomes:

We transfer 0 bytes then throw an exception.
We transfer some arbitrary number of bytes between 0 and 9,999,999,995, then throw an exception.
We exhaust the source by transferring 9,999,999,995 bytes, then throw an exception.

I believe 1 is bad behavior because it requires us to load almost 10 GiB of data into memory before proceeding. The readFully() function should do streaming. Not streaming the 8 bytes of readLong() one-at-a-time is different here, because the API returns all 8 bytes as a unit.

I believe 2 is bad behavior because it’s non-deterministic. Any internal buffer size should not have user-observable effects. If the internal buffer size was 11 billion bytes vs. 8 KiB, the behavior here is probably quite different.

Therefore I claim that 3 is the least bad of our options. Yes it sucks to waste effort copying data if the overall operation is going to ultimately fail. But that’s decidedly not on the happy-path.

Buffered Source

If the source was instead a Buffer, we don’t have to worry about buffering too much, because we know we’ve already buffered everything. In effect, case 1 above is not a performance problem when the source is a buffer.

I believe that by the Liskov Substitution Principle (LSP) it’s better for Okio to implement the same behavior regardless of the type of the source. Doing LSP ensures that if you use a buffer in tests, and a streaming source in production, your tests remain representative of production behavior.

fzhinkin · 2023-06-13T12:51:10Z

@swankjesse
Thanks for such a detailed answer! I agree that it's unsafe to leave a buffer with an arbitrary number of bytes after an error. However, there is also readUtf8(bytes) that won’t consume any data from a source’s buffer if there are not enough bytes, leaving a buffer full after throwing an exception.

Should the source be fully consumed by readUtf8 even if there is insufficient data (to stick to the same logic as with readFully)?

swankjesse · 2023-06-13T17:30:47Z

Lemme rank some potential goals for an I/O function:

It should be deterministic
It should satisfy LSP
It should not consume an arbitrary amount of resources
It should not have side-effects when it fails

For readLong, we can satisfy all the goals.

For readUtf8, the API design itself prevents us from satisfying goal 3. This is okay! In order to produce a string of the caller’s requested length, we need to have that many bytes in memory simultaneously.

For readFully(RawSink), we can’t satisfy all four. My preference is to satisfy 1, 2, and 3. I acknowledge that this is inconsistent with readUtf8 which satisfies 1, 2, and 4. But I claim that even though readUtf8 can’t meet goal 3, it’s still an important goal to strive for for the other APIs.

fzhinkin · 2023-06-16T08:40:14Z

After all, it seems fine to have slightly different consumption guarantees for different methods, it just needs to be explicitly described in the documentation.

fzhinkin self-assigned this Jun 12, 2023

fzhinkin added this to the 0.2.0 milestone Jun 12, 2023

fzhinkin mentioned this issue Jun 12, 2023

A trimmed-down API #136

Merged

3 tasks

whyoleg mentioned this issue Jun 13, 2023

Inconsistent and confusing Sink/Source read/write-methods naming #137

Closed

fzhinkin linked a pull request Jun 23, 2023 that will close this issue

A trimmed-down API #136

Merged

3 tasks

fzhinkin closed this as completed Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source::readFully inconsistent with other read methods #139

Source::readFully inconsistent with other read methods #139

fzhinkin commented Jun 12, 2023

swankjesse commented Jun 13, 2023

fzhinkin commented Jun 13, 2023

swankjesse commented Jun 13, 2023

fzhinkin commented Jun 16, 2023

Source::readFully inconsistent with other read methods #139

Source::readFully inconsistent with other read methods #139

Comments

fzhinkin commented Jun 12, 2023

swankjesse commented Jun 13, 2023

Streaming Source

Buffered Source

fzhinkin commented Jun 13, 2023

swankjesse commented Jun 13, 2023

fzhinkin commented Jun 16, 2023