Add Reader.read_at_least() #13127

lilyball · 2014-03-25T06:25:48Z

Reader.read_at_least() ensures that at least a given number of bytes
have been read. The most common use-case for this is ensuring at least 1
byte has been read. If the reader returns 0 enough times in a row, a new
error kind NoProgress will be returned instead of looping infinitely.

This change is necessary in order to properly support Readers that
repeatedly return 0, either because they're broken, or because they're
attempting to do a non-blocking read on some resource that never becomes
available.

Also add .push() and .push_at_least() methods. push() is like read() but
the results are appended to the passed Vec.

Remove Reader.fill() and Reader.push_exact() as they end up being thin
wrappers around read_at_least() and push_at_least().

[breaking-change]

sfackler · 2014-03-25T07:05:52Z

src/libstd/io/mod.rs

@@ -320,8 +319,12 @@ pub enum IoErrorKind {
    ResourceUnavailable,
    IoUnavailable,
    InvalidInput,
+    /// The Reader returned 0 from `read()` too many times.


I'd clarify this to "0 bytes"

alexcrichton · 2014-03-25T17:05:32Z

I'm a little uncomfortable about the litany of methods that are getting added to the Reader trait. I was myself uncomfortable adding the fill method. I'd like to take some time to think about whether we can reduce the surface area here to something more reasonable.

Approaching Reader for the first time, you'll be bombarded with read, read_at_least, fill, etc. I feel like we can do better to solve the constraints in play here without as many methods.

lilyball · 2014-03-25T17:15:41Z

@alexcrichton fill turns out to be trivially expressible using read_at_least(), and in fact I reimplemented it in one line using that. So maybe we can just remove fill(). I think read_at_least() is valuable, though, because it's a single central place to handle 0-length reads correctly. Similarly push_exact() is expressed in terms of push_at_least() and can be removed, and read_exact() also seems rather superfluous as it's already just a combination of slice::with_capacity() and push_exact().

I'm also open to the suggestion of renaming it to read_nonzero() and dropping the len param, but then fill() can't be expressed using it and we would need to keep it around.

lilyball · 2014-03-26T05:04:10Z

@alexcrichton I removed fill() and push_exact(), because they're easy to reproduce with read_at_least()/push_at_least(). I tried removing read_exact() too but the necessary replacement ended up being annoying enough that I think it's useful to keep.

huonw · 2014-03-26T05:35:47Z

src/libstd/io/mod.rs

+    /// Fails if `len` is greater than the length of `buf`.
+    fn read_at_least(&mut self, buf: &mut [u8], len: uint) -> IoResult<uint> {
+        assert!(len <= buf.len());
+        // always read at least once in case len == 0


Why do we need to read at least once?

Because it seems exceedingly odd to me to call read_at_least(buf, n) and have it read zero times.

alexcrichton · 2014-03-26T17:57:15Z

Hm, I'm going to try to put some thinking down into words, this is all just thinking out loud.

Before this change, we have this list of methods for dealing with a partial amount of bytes on a Reader. I'm omitting read because it's obvious that this will stay forever.

fill a &mut [u8]
push_exact on &mut ~[u8]
read_exact returning ~[u8]

After this change, we have this list of methods

read_at_least on &mut [u8]
push on &mut ~[u8]
push_at_least on &mut ~[u8]
read_exact returning ~[u8]

This still seems a little sprawling to me, especially when I look at the types that everything operates on. For example, read_at_least pushes onto &mut [u8], but the very similarly named read_exact returns a ~[u8] (same for fill before).

Some thoughts:

Do we want to continue to support the "push some bytes on this vector" pattern?
Do we want to continue to have methods that returned an owned vector? (read_to_end I believe is an exception to this)

Depending on the answers, we may be able to pare down the API to

fn read(&mut self, buf: &mut [u8]) -> IoResult<uint>;
fn read_at_least(&mut self, buf: &mut [u8], amt: uint) -> IoResult<uint>;
fn read_exact(&mut self, buf: &mut [u8]) -> IoResult<()>;

I like that this only operates on &mut [u8].

A little bit of a ramble, but I'm thinking that these utility functions on a Reader may need to be reconsidered. I think I'm more than comfortable removing read_exact returning ~[u8], but I'm less sure about removing methods that operate on &mut ~[u8].

Thoughts on this? @kballard, @brson

lilyball · 2014-03-26T18:08:04Z

@alexcrichton I would like to remove read_exact(). I left it because the following code:

let bytes = try!(file.read_exact(names_bytes as uint - 1));

became the following:

let bytes = {
    let len = names_bytes as uint - 1;
    let mut bytes = slice::with_capacity(len);
    try!(file.push_at_least(&mut bytes, len, len));
    bytes
};

and that felt just awkward enough that I decided not to remove it at this time. But I only count 5 calls to read_exact() in our source, so maybe we should go ahead and remove it.

As for push()/push_at_least(), I would like to be able to remove them, but I think they're valuable. The alternative implementation is to create a vector with the desired length, pre-filled with zeroes, read into it, and then truncate back to the actual read amount. That's awkward, and slightly less performant due to the need to memset the vector (push() uses unsafe and uninitialized memory, which can be valuable when e.g. read_to_end() is calling it with 64k as the length).

I could see removing push(), which I added mostly to balance out read(), but could be folded into push_at_least(). In fact, I think I'll go ahead and remove that, because anyone who wants it can use push_at_least(buf, 0, len).

lilyball · 2014-03-26T18:09:49Z

Also, looking at your suggested API again, fn read_exact(&mut self, buf: &mut [u8]) -> IoResult<()>; can be removed entirely as that's just read_at_least(buf, buf.len()).map(|_| ()).

lilyball · 2014-03-26T18:23:42Z

Looking at the API again, I think that read_to_end() and read_to_str() are very useful to keep around. Similarly, all of the helpers such as read_byte(), read_le_uint_n(), etc are useful. And I suspect bytes() may be useful too, but offhand I don't know how much it's actually being used.

I'm wondering if there's any utility to splitting Reader into two separate traits, with Reader containing read() and read_at_least(), and ReaderHelpers containing everything else (implemented in terms of Reader). I know that default trait methods have made us move away from this style of traits, but it would serve as a conceptual simplification, between the core Reader methods and all of the helpers that are implemented in terms of it. Although one downside is this would prevent any individual Reader implementation from overriding the implementation of any of the helper methods (e.g. a string-based reader could theoretically override read_to_str() to skip the utf-8 checks).

In any case, I'm open to removing read_exact(), but I won't do so unless one of you think it's a good idea. I will remove push(), but push_at_least() is also useful, primarily because it wraps the unsafety necessary to make reading into a large newly-allocated buffer more efficient. I think everything else can stay.

DaGenix · 2014-03-27T04:05:42Z

What about doing something like how the gen() method is implemented on Rng? Then, there could be methods like Reader.next() to read a fixed-sized object like a uint and Reader.next_bytes(uint) to read a non-fixed sized object like a Vec.

lilyball · 2014-03-27T05:23:55Z

@DaGenix Reader.next() doesn't make much sense because the only reasonable fixed-sized object to read is a u8 and you don't want to read single bytes at a time if you can help it from most readers.

As for doing something like Rand where each type knows how to generate itself from a Rng, the problem is that the only types that we're trying to read are either buffers, which require a length and don't benefit from a Rand style interface, and various encodings of integers, which cannot use a Rand-like interface unless you define an enum to represent the desired encoding and pass that as an argument, and I think that's a more confusing API than what we have today.

DaGenix · 2014-03-29T04:32:07Z

I was thinking that .next() could read whatever type is appropriate depending upon the value being assigned to - byte, uint, float, etc. However, I failed to consider big-ending vs. little-endian issues. So, yeah, nevemind this idea.

lilyball · 2014-04-01T06:53:51Z

@alexcrichton, @brson: Any more thoughts on this PR?

lilyball · 2014-04-09T21:24:40Z

I just rebased on top of the recent ~[T]->Vec<T> work that went into std::io.

alexcrichton · 2014-04-10T15:44:07Z

I'm sorry it took awhile to get back on this, but can you elaborate on what the use case for this is? I can't really think of a case where I want to read N bytes, but if I read some number greater than N I can easily deal with it.

pongad · 2014-04-12T17:10:18Z

@alexcrichton I think the rationale for this is to provide a convenience method that deals with zero-length reads. I'm not sure if @kballard has other use cases in mind. Though, now that I think about it, for solving zero-length reads, maybe read_any (same as read_at_least(1)) might be more appropriate.
IMO, it is difficult to think about what read_at_least should do if we want to read 2 bytes, but there's only 1 byte left in the Reader. We probably want to return an error since we can't read as many bytes as we want, but then we need to include how many bytes we read in the error, etc. read_any would sidestep this problem since the user won't be able to tell read_any how many bytes to read. It will either just return the number of bytes read (guaranteed to be greater than 0) or error if it cannot read anything or encountered some other error along the way. Go doesn't have a problem with this since ReadAtLeast returns both the bytes read and the error at the same time.

lilyball · 2014-04-12T17:15:33Z

The short answer is we need a read_at_least_one() and having read_at_least() instead allows us to remove fill() (as that's just a special case of read_at_least()). Similarly push_at_least() removes the need for push_exact(). This avoids ballooning the API.

lilyball · 2014-04-25T02:47:12Z

Rebased on top of master. I still think this is something we should do.

@pongad: I don't think that not returning the number of bytes read is a problem. It's hard to think of a use case where those bytes are actually useful. The only real case I can think of where I'd want to be able to use those bytes is if I'm trying to report an error that includes the truncated bytes, and that seems niche enough that it's not worth complicating the API.

lilyball · 2014-05-06T04:13:47Z

@alexcrichton et al, any more feedback? I need to rebase it at this point, but I'd like to know that we can move forward with this.

alexcrichton · 2014-05-06T15:36:46Z

src/libstd/vec.rs

+            data: self.as_ptr().offset(start as int),
+            len: (end - start)
+        })
+    }


Can we avoid extending Vec's api further?

So you prefer the ugly code that slice_capacity() is replacing? To be clear, that's calling set_len() to pretend uninitialized memory is valid, slicing that, then using try_finally() to call set_len() back to the correct value when done.

No, I prefer to not expand Vec's api on a whim. If you want to add this as a private helper in std::io, that's fine.

alexcrichton · 2014-05-06T15:41:35Z

I'm ok with merging this pending comments. It will need a re-worded commit message to reflect the breaking change as well.

lilyball · 2014-05-06T17:19:48Z

@alexcrichton Thanks. I need to rebase it, and I'll reword the commit message. Based on my reply to your comments, do you still believe I need to change the semantics of read_at_least()/push_at_least() to avoid the read when min == 0, or are you ok with leaving it as-is?

alexcrichton · 2014-05-06T17:21:56Z

It looks like this is kinda based off what go is doing, and they don't do the read when min == 0. I'm not too worried about doing the read per se, but I'm a little worried to be documenting explicitly that a read is always performed.

Additionally, perhaps an error should be returned rather than failing in these new methods?

lilyball · 2014-05-06T17:32:16Z

Hmm, I didn't check to see what Go does. I'm actually vaguely surprised to see that ReadAtLeast(r, buf, 0) returns (0, err). But that is indeed what it appears to do.

I documented that the read is always performed because I wanted clients to be able to rely on it. If you don't want it documented, then I think it's better to not do the read when min == 0. The reason is I don't want clients observing that it always reads when min == 0, relying on that assumption, then breaking later if behavior changes (and if behavior never changes, then why omit it from the documentation?)

Given this precedent, I'll go ahead and change the behavior to skip the read.

perhaps an error should be returned rather than failing in these new methods?

The failure happens in response to a logic error, i.e. the client passing a min that's too high. Handling this as an error instead requires adding yet another IoErrorKind to indicate the buffer was too short. This is an error that seems like it will only apply to read_at_least() and to no other function. Do you think it's better to have this once-off error value than it is to simply fail on logic error?

alexcrichton · 2014-05-06T17:35:13Z

Our current spirit is to not fail as much as possible, and this seems like an easy case to not fail (you're already returning an error), and I figured that the InvalidInput error kind would suffice in this case. You can fill out the detail with some extra information about what just happened.

lilyball · 2014-05-06T17:39:42Z

I didn't realize that's what InvalidInput meant. I suppose it is usable for this. I'll make that change.

lilyball · 2014-05-07T06:11:39Z

r? @alexcrichton I've made the requested changes, and also squashed it down to one commit.

Reader.read_at_least() ensures that at least a given number of bytes have been read. The most common use-case for this is ensuring at least 1 byte has been read. If the reader returns 0 enough times in a row, a new error kind NoProgress will be returned instead of looping infinitely. This change is necessary in order to properly support Readers that repeatedly return 0, either because they're broken, or because they're attempting to do a non-blocking read on some resource that never becomes available. Also add .push() and .push_at_least() methods. push() is like read() but the results are appended to the passed Vec. Remove Reader.fill() and Reader.push_exact() as they end up being thin wrappers around read_at_least() and push_at_least(). [breaking-change]

sfackler reviewed Mar 25, 2014
View reviewed changes

huonw reviewed Mar 26, 2014
View reviewed changes

alexcrichton reviewed May 6, 2014
View reviewed changes

bors merged commit 972f2e5 into rust-lang:master May 14, 2014

lilyball deleted the read_at_least branch May 14, 2014 04:34

alexcrichton mentioned this pull request Oct 7, 2014

Reader should handle 0-length reads better #13119

Closed

Add Reader.read_at_least() #13127

Add Reader.read_at_least() #13127

Uh oh!

Conversation

lilyball commented Mar 25, 2014

Uh oh!

sfackler Mar 25, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Mar 25, 2014

Uh oh!

lilyball commented Mar 25, 2014

Uh oh!

lilyball commented Mar 26, 2014

Uh oh!

huonw Mar 26, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball Mar 26, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Mar 26, 2014

Uh oh!

lilyball commented Mar 26, 2014

Uh oh!

lilyball commented Mar 26, 2014

Uh oh!

lilyball commented Mar 26, 2014

Uh oh!

DaGenix commented Mar 27, 2014

Uh oh!

lilyball commented Mar 27, 2014

Uh oh!

DaGenix commented Mar 29, 2014

Uh oh!

lilyball commented Apr 1, 2014

Uh oh!

lilyball commented Apr 9, 2014

Uh oh!

alexcrichton commented Apr 10, 2014

Uh oh!

pongad commented Apr 12, 2014

Uh oh!

lilyball commented Apr 12, 2014

Uh oh!

lilyball commented Apr 25, 2014

Uh oh!

lilyball commented May 6, 2014

Uh oh!

alexcrichton May 6, 2014

Choose a reason for hiding this comment

Uh oh!

lilyball May 6, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton May 6, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented May 6, 2014

Uh oh!

lilyball commented May 6, 2014

Uh oh!

alexcrichton commented May 6, 2014

Uh oh!

lilyball commented May 6, 2014

Uh oh!

alexcrichton commented May 6, 2014

Uh oh!

lilyball commented May 6, 2014

Uh oh!

lilyball commented May 7, 2014

Uh oh!

Uh oh!