Skip to content

Commit 894ab34

Browse files
committedJul 17, 2015
Merge pull request #2 from cesarb/read_all
Make this RFC be again about a single method
2 parents ddf9eff + 4732352 commit 894ab34

File tree

1 file changed

+221
-59
lines changed

1 file changed

+221
-59
lines changed
 

‎text/0000-read-all.md

+221-59
Original file line numberDiff line numberDiff line change
@@ -1,99 +1,261 @@
1-
- Feature Name: read_exact and read_full
1+
- Feature Name: read_exact and ErrorKind::UnexpectedEOF
22
- Start Date: 2015-03-15
33
- RFC PR: (leave this empty)
44
- Rust Issue: (leave this empty)
55

66
# Summary
77

8-
Rust's `Write` trait has `write_all`, which is a convenience method that calls
9-
`write` repeatedly to write an entire buffer. This proposal adds two similar
10-
convenience methods to the `Read` trait: `read_full` and `read_exact`.
11-
`read_full` calls `read` repeatedly until the buffer has been filled, EOF has
12-
been reached, or an error other than `Interrupted` occurs. `read_exact` is
13-
similar to `read_full`, except that reaching EOF before filling the buffer is
14-
considered an error.
8+
Rust's `Write` trait has the `write_all` method, which is a convenience
9+
method that writes a whole buffer, failing with `ErrorKind::WriteZero`
10+
if the buffer cannot be written in full.
11+
12+
This RFC proposes adding its `Read` counterpart: a method (here called
13+
`read_exact`) that reads a whole buffer, failing with an error (here
14+
called `ErrorKind::UnexpectedEOF`) if the buffer cannot be read in full.
1515

1616
# Motivation
1717

18-
The `read` method may return fewer bytes than requested, and may fail with an
19-
`Interrupted` error if a signal is received during the call. This requires
20-
programs wishing to fill a buffer to call `read` repeatedly in a loop. This is
21-
a very common need, and it would be nice if this functionality were provided in
22-
the standard library. Many C and Rust programs have the same need, and solve it
23-
in the same way. For example, Git has [`read_in_full`][git], which behaves like
24-
the proposed `read_full`, and the Rust byteorder crate has
25-
[`read_full`][byteorder], which behaves like the proposed `read_exact`.
26-
[git]: https://github.com/git/git/blob/16da57c7c6c1fe92b32645202dd19657a89dd67d/wrapper.c#L246
27-
[byteorder]: https://github.com/BurntSushi/byteorder/blob/2358ace61332e59f596c9006e1344c97295fdf72/src/new.rs#L184
18+
When dealing with serialization formats with fixed-length fields,
19+
reading or writing less than the field's size is an error. For the
20+
`Write` side, the `write_all` method does the job; for the `Read` side,
21+
however, one has to call `read` in a loop until the buffer is completely
22+
filled, or until a premature EOF is reached.
23+
24+
This leads to a profusion of similar helper functions. For instance, the
25+
`byteorder` crate has a `read_full` function, and the `postgres` crate
26+
has a `read_all` function. However, their handling of the premature EOF
27+
condition differs: the `byteorder` crate has its own `Error` enum, with
28+
`UnexpectedEOF` and `Io` variants, while the `postgres` crate uses an
29+
`io::Error` with an `io::ErrorKind::Other`.
30+
31+
That can make it unnecessarily hard to mix uses of these helper
32+
functions; for instance, if one wants to read a 20-byte tag (using one's
33+
own helper function) followed by a big-endian integer, either the helper
34+
function has to be written to use `byteorder::Error`, or the calling
35+
code has to deal with two different ways to represent a premature EOF,
36+
depending on which field encountered the EOF condition.
37+
38+
Additionally, when reading from an in-memory buffer, looping is not
39+
necessary; it can be replaced by a size comparison followed by a
40+
`copy_memory` (similar to `write_all` for `&mut [u8]`). If this
41+
non-looping implementation is `#[inline]`, and the buffer size is known
42+
(for instance, it's a fixed-size buffer in the stack, or there was an
43+
earlier check of the buffer size against a larger value), the compiler
44+
could potentially turn a read from the buffer followed by an endianness
45+
conversion into the native endianness (as can happen when using the
46+
`byteorder` crate) into a single-instruction direct load from the buffer
47+
into a register.
2848

2949
# Detailed design
3050

31-
The following methods will be added to the `Read` trait:
51+
First, a new variant `UnexpectedEOF` is added to the `io::ErrorKind` enum.
52+
53+
The following method is added to the `Read` trait:
3254

3355
``` rust
34-
fn read_full(&mut self, buf: &mut [u8]) -> Result<usize>;
3556
fn read_exact(&mut self, buf: &mut [u8]) -> Result<()>;
3657
```
3758

38-
Additionally, default implementations of these methods will be provided:
59+
Aditionally, a default implementation of this method is provided:
3960

4061
``` rust
41-
fn read_full(&mut self, mut buf: &mut [u8]) -> Result<usize> {
42-
let mut read = 0;
43-
while buf.len() > 0 {
62+
fn read_exact(&mut self, mut buf: &mut [u8]) -> Result<()> {
63+
while !buf.is_empty() {
4464
match self.read(buf) {
4565
Ok(0) => break,
46-
Ok(n) => { read += n; let tmp = buf; buf = &mut tmp[n..]; }
66+
Ok(n) => { let tmp = buf; buf = &mut tmp[n..]; }
4767
Err(ref e) if e.kind() == ErrorKind::Interrupted => {}
4868
Err(e) => return Err(e),
4969
}
5070
}
51-
Ok(read)
52-
}
53-
54-
fn read_exact(&mut self, buf: &mut [u8]) -> Result<()> {
55-
if try!(self.read_full(buf)) != buf.len() {
71+
if !buf.is_empty() {
5672
Err(Error::new(ErrorKind::UnexpectedEOF, "failed to fill whole buffer"))
5773
} else {
5874
Ok(())
5975
}
6076
}
6177
```
6278

63-
Finally, a new `ErrorKind::UnexpectedEOF` will be introduced, which will be
64-
returned by `read_exact` in the event of a premature EOF.
79+
And an optimized implementation of this method for `&[u8]` is provided:
80+
81+
```rust
82+
#[inline]
83+
fn read_exact(&mut self, buf: &mut [u8]) -> Result<()> {
84+
if (buf.len() > self.len()) {
85+
return Err(Error::new(ErrorKind::UnexpectedEOF, "failed to fill whole buffer"));
86+
}
87+
let (a, b) = self.split_at(buf.len());
88+
slice::bytes::copy_memory(a, buf);
89+
*self = b;
90+
Ok(())
91+
}
92+
```
93+
94+
The detailed semantics of `read_exact` are as follows: `read_exact`
95+
reads exactly the number of bytes needed to completely fill its `buf`
96+
parameter. If that's not possible due to an "end of file" condition
97+
(that is, the `read` method would return 0 even when passed a buffer
98+
with at least one byte), it returns an `ErrorKind::UnexpectedEOF` error.
99+
100+
On success, the read pointer is advanced by the number of bytes read, as
101+
if the `read` method had been called repeatedly to fill the buffer. On
102+
any failure (including an `ErrorKind::UnexpectedEOF`), the read pointer
103+
might have been advanced by any number between zero and the number of
104+
bytes requested (inclusive), and the contents of its `buf` parameter
105+
should be treated as garbage (any part of it might or might not have
106+
been overwritten by unspecified data).
107+
108+
Even if the failure was an `ErrorKind::UnexpectedEOF`, the read pointer
109+
might have been advanced by a number of bytes less than the number of
110+
bytes which could be read before reaching an "end of file" condition.
111+
112+
The `read_exact` method will never return an `ErrorKind::Interrupted`
113+
error, similar to the `read_to_end` method.
114+
115+
Similar to the `read` method, no guarantees are provided about the
116+
contents of `buf` when this function is called; implementations cannot
117+
rely on any property of the contents of `buf` being true. It is
118+
recommended that implementations only write data to `buf` instead of
119+
reading its contents.
120+
121+
# About ErrorKind::Interrupted
122+
123+
Whether or not `read_exact` can return an `ErrorKind::Interrupted` error
124+
is orthogonal to its semantics. One could imagine an alternative design
125+
where `read_exact` could return an `ErrorKind::Interrupted` error.
126+
127+
The reason `read_exact` should deal with `ErrorKind::Interrupted` itself
128+
is its non-idempotence. On failure, it might have already partially
129+
advanced its read pointer an unknown number of bytes, which means it
130+
can't be easily retried after an `ErrorKind::Interrupted` error.
131+
132+
One could argue that it could return an `ErrorKind::Interrupted` error
133+
if it's interrupted before the read pointer is advanced. But that
134+
introduces a non-orthogonality in the design, where it might either
135+
return or retry depending on whether it was interrupted at the beginning
136+
or in the middle. Therefore, the cleanest semantics is to always retry.
137+
138+
There's precedent for this choice in the `read_to_end` method. Users who
139+
need finer control should use the `read` method directly.
140+
141+
# About the read pointer
142+
143+
This RFC proposes a `read_exact` function where the read pointer
144+
(conceptually, what would be returned by `Seek::seek` if the stream was
145+
seekable) is unspecified on failure: it might not have advanced at all,
146+
have advanced in full, or advanced partially.
147+
148+
Two possible alternatives could be considered: never advance the read
149+
pointer on failure, or always advance the read pointer to the "point of
150+
error" (in the case of `ErrorKind::UnexpectedEOF`, to the end of the
151+
stream).
152+
153+
Never advancing the read pointer on failure would make it impossible to
154+
have a default implementation (which calls `read` in a loop), unless the
155+
stream was seekable. It would also impose extra costs (like creating a
156+
temporary buffer) to allow "seeking back" for non-seekable streams.
157+
158+
Always advancing the read pointer to the end on failure is possible; it
159+
happens without any extra code in the default implementation. However,
160+
it can introduce extra costs in optimized implementations. For instance,
161+
the implementation given above for `&[u8]` would need a few more
162+
instructions in the error case. Some implementations (for instance,
163+
reading from a compressed stream) might have a larger extra cost.
164+
165+
The utility of always advancing the read pointer to the end is
166+
questionable; for non-seekable streams, there's not much that can be
167+
done on an "end of file" condition, so most users would discard the
168+
stream in both an "end of file" and an `ErrorKind::UnexpectedEOF`
169+
situation. For seekable streams, it's easy to seek back, but most users
170+
would treat an `ErrorKind::UnexpectedEOF` as a "corrupted file" and
171+
discard the stream anyways.
172+
173+
Users who need finer control should use the `read` method directly, or
174+
when available use the `Seek` trait.
175+
176+
# Naming
177+
178+
It's unfortunate that `write_all` used `WriteZero` for its `ErrorKind`;
179+
were it named `UnexpectedEOF` (which is a much more intuitive name), the
180+
same `ErrorKind` could be used for both functions.
181+
182+
The initial proposal for this `read_exact` method called it `read_all`,
183+
for symmetry with `write_all`. However, that name could also be
184+
interpreted as "read as many bytes as you can that fit on this buffer,
185+
and return what you could read" instead of "read enough bytes to fill
186+
this buffer, and fail if you couldn't read them all". The previous
187+
discussion led to `read_exact` for the later meaning, and `read_full`
188+
for the former meaning.
65189

66190
# Drawbacks
67191

68-
Like `write_all`, these APIs are lossy: in the event of an error, there is no
69-
way to determine the number of bytes that were successfully read before the
70-
error. However, doing so would complicate the methods, and the caller will want
71-
to simply fail if an error occurs the vast majority of the time. Situations
72-
that require lower level control can still use `read` directly.
192+
If this method fails, the buffer contents are undefined; the
193+
`read_exact' method might have partially overwritten it. If the caller
194+
requires "all-or-nothing" semantics, it must clone the buffer. In most
195+
use cases, this is not a problem; the caller will discard or overwrite
196+
the buffer in case of failure.
73197

74-
# Unanswered Questions
198+
In the same way, if this method fails, there is no way to determine how
199+
many bytes were read before it determined it couldn't completely fill
200+
the buffer.
75201

76-
Naming. Is `read_full` the best name? Should `UnexpectedEOF` instead be
77-
`ShortRead` or `ReadZero`?
202+
Situations that require lower level control can still use `read`
203+
directly.
78204

79205
# Alternatives
80206

81-
Use a more complicated return type to allow callers to retrieve the number of
82-
bytes successfully read before an error occurred. As explained above, this
83-
would complicate the use of these methods for very little gain. It's worth
84-
noting that git's `read_in_full` is similarly lossy, and just returns an error
85-
even if some bytes have been read.
86-
87-
Only provide `read_exact`, but parameterize the `UnexpectedEOF` or `ShortRead`
88-
error kind with the number of bytes read to allow it to be used in place of
89-
`read_full`. This would be less convenient to use in cases where EOF is not an
90-
error.
91-
92-
Only provide `read_full`. This would cover most of the convenience (callers
93-
could avoid the read loop), but callers requiring a filled buffer would have to
94-
manually check if all of the desired bytes were read.
95-
96-
Finally, we could leave this out, and let every Rust user needing this
97-
functionality continue to write their own `read_full` or `read_exact` function,
98-
or have to track down an external crate just for one straightforward and
99-
commonly used convenience method.
207+
The first alternative is to do nothing. Every Rust user needing this
208+
functionality continues to write their own read_full or read_exact
209+
function, or have to track down an external crate just for one
210+
straightforward and commonly used convenience method. Additionally,
211+
unless everybody uses the same external crate, every reimplementation of
212+
this method will have slightly different error handling, complicating
213+
mixing users of multiple copies of this convenience method.
214+
215+
The second alternative is to just add the `ErrorKind::UnexpectedEOF` or
216+
similar. This would lead in the long run to everybody using the same
217+
error handling for their version of this convenience method, simplifying
218+
mixing their uses. However, it's questionable to add an `ErrorKind`
219+
variant which is never used by the standard library.
220+
221+
Another alternative is to return the number of bytes read in the error
222+
case. That makes the buffer contents defined also in the error case, at
223+
the cost of increasing the size of the frequently-used `io::Error`
224+
struct, for a rarely used return value. My objections to this
225+
alternative are:
226+
227+
* If the caller has an use for the partially written buffer contents,
228+
then it's treating the "buffer partially filled" case as an
229+
alternative success case, not as a failure case. This is not a good
230+
match for the semantics of an `Err` return.
231+
* Determining that the buffer cannot be completely filled can in some
232+
cases be much faster than doing a partial copy. Many callers are not
233+
going to be interested in an incomplete read, meaning that all the
234+
work of filling the buffer is wasted.
235+
* As mentioned, it increases the size of a commonly used type in all
236+
cases, even when the code has no mention of `read_exact`.
237+
238+
The final alternative is `read_full`, which returns the number of bytes
239+
read (`Result<usize>`) instead of failing. This means that every caller
240+
has to check the return value against the size of the passed buffer, and
241+
some are going to forget (or misimplement) the check. It also prevents
242+
some optimizations (like the early return in case there will never be
243+
enough data). There are, however, valid use cases for this alternative;
244+
for instance, reading a file in fixed-size chunks, where the last chunk
245+
(and only the last chunk) can be shorter. I believe this should be
246+
discussed as a separate proposal; its pros and cons are distinct enough
247+
from this proposal to merit its own arguments.
248+
249+
I believe that the case for `read_full` is weaker than `read_exact`, for
250+
the following reasons:
251+
252+
* While `read_exact` needs an extra variant in `ErrorKind`, `read_full`
253+
has no new error cases. This means that implementing it yourself is
254+
easy, and multiple implementations have no drawbacks other than code
255+
duplication.
256+
* While `read_exact` can be optimized with an early return in cases
257+
where the reader knows its total size (for instance, reading from a
258+
compressed file where the uncompressed size was given in a header),
259+
`read_full` has to always write to the output buffer, so there's not
260+
much to gain over a generic looping implementation calling `read`.
261+

0 commit comments

Comments
 (0)
Please sign in to comment.