|
1 |
| -- Feature Name: read_exact and read_full |
| 1 | +- Feature Name: read_exact and ErrorKind::UnexpectedEOF |
2 | 2 | - Start Date: 2015-03-15
|
3 | 3 | - RFC PR: (leave this empty)
|
4 | 4 | - Rust Issue: (leave this empty)
|
5 | 5 |
|
6 | 6 | # Summary
|
7 | 7 |
|
8 |
| -Rust's `Write` trait has `write_all`, which is a convenience method that calls |
9 |
| -`write` repeatedly to write an entire buffer. This proposal adds two similar |
10 |
| -convenience methods to the `Read` trait: `read_full` and `read_exact`. |
11 |
| -`read_full` calls `read` repeatedly until the buffer has been filled, EOF has |
12 |
| -been reached, or an error other than `Interrupted` occurs. `read_exact` is |
13 |
| -similar to `read_full`, except that reaching EOF before filling the buffer is |
14 |
| -considered an error. |
| 8 | +Rust's `Write` trait has the `write_all` method, which is a convenience |
| 9 | +method that writes a whole buffer, failing with `ErrorKind::WriteZero` |
| 10 | +if the buffer cannot be written in full. |
| 11 | + |
| 12 | +This RFC proposes adding its `Read` counterpart: a method (here called |
| 13 | +`read_exact`) that reads a whole buffer, failing with an error (here |
| 14 | +called `ErrorKind::UnexpectedEOF`) if the buffer cannot be read in full. |
15 | 15 |
|
16 | 16 | # Motivation
|
17 | 17 |
|
18 |
| -The `read` method may return fewer bytes than requested, and may fail with an |
19 |
| -`Interrupted` error if a signal is received during the call. This requires |
20 |
| -programs wishing to fill a buffer to call `read` repeatedly in a loop. This is |
21 |
| -a very common need, and it would be nice if this functionality were provided in |
22 |
| -the standard library. Many C and Rust programs have the same need, and solve it |
23 |
| -in the same way. For example, Git has [`read_in_full`][git], which behaves like |
24 |
| -the proposed `read_full`, and the Rust byteorder crate has |
25 |
| -[`read_full`][byteorder], which behaves like the proposed `read_exact`. |
26 |
| -[git]: https://github.com/git/git/blob/16da57c7c6c1fe92b32645202dd19657a89dd67d/wrapper.c#L246 |
27 |
| -[byteorder]: https://github.com/BurntSushi/byteorder/blob/2358ace61332e59f596c9006e1344c97295fdf72/src/new.rs#L184 |
| 18 | +When dealing with serialization formats with fixed-length fields, |
| 19 | +reading or writing less than the field's size is an error. For the |
| 20 | +`Write` side, the `write_all` method does the job; for the `Read` side, |
| 21 | +however, one has to call `read` in a loop until the buffer is completely |
| 22 | +filled, or until a premature EOF is reached. |
| 23 | + |
| 24 | +This leads to a profusion of similar helper functions. For instance, the |
| 25 | +`byteorder` crate has a `read_full` function, and the `postgres` crate |
| 26 | +has a `read_all` function. However, their handling of the premature EOF |
| 27 | +condition differs: the `byteorder` crate has its own `Error` enum, with |
| 28 | +`UnexpectedEOF` and `Io` variants, while the `postgres` crate uses an |
| 29 | +`io::Error` with an `io::ErrorKind::Other`. |
| 30 | + |
| 31 | +That can make it unnecessarily hard to mix uses of these helper |
| 32 | +functions; for instance, if one wants to read a 20-byte tag (using one's |
| 33 | +own helper function) followed by a big-endian integer, either the helper |
| 34 | +function has to be written to use `byteorder::Error`, or the calling |
| 35 | +code has to deal with two different ways to represent a premature EOF, |
| 36 | +depending on which field encountered the EOF condition. |
| 37 | + |
| 38 | +Additionally, when reading from an in-memory buffer, looping is not |
| 39 | +necessary; it can be replaced by a size comparison followed by a |
| 40 | +`copy_memory` (similar to `write_all` for `&mut [u8]`). If this |
| 41 | +non-looping implementation is `#[inline]`, and the buffer size is known |
| 42 | +(for instance, it's a fixed-size buffer in the stack, or there was an |
| 43 | +earlier check of the buffer size against a larger value), the compiler |
| 44 | +could potentially turn a read from the buffer followed by an endianness |
| 45 | +conversion into the native endianness (as can happen when using the |
| 46 | +`byteorder` crate) into a single-instruction direct load from the buffer |
| 47 | +into a register. |
28 | 48 |
|
29 | 49 | # Detailed design
|
30 | 50 |
|
31 |
| -The following methods will be added to the `Read` trait: |
| 51 | +First, a new variant `UnexpectedEOF` is added to the `io::ErrorKind` enum. |
| 52 | + |
| 53 | +The following method is added to the `Read` trait: |
32 | 54 |
|
33 | 55 | ``` rust
|
34 |
| -fn read_full(&mut self, buf: &mut [u8]) -> Result<usize>; |
35 | 56 | fn read_exact(&mut self, buf: &mut [u8]) -> Result<()>;
|
36 | 57 | ```
|
37 | 58 |
|
38 |
| -Additionally, default implementations of these methods will be provided: |
| 59 | +Aditionally, a default implementation of this method is provided: |
39 | 60 |
|
40 | 61 | ``` rust
|
41 |
| -fn read_full(&mut self, mut buf: &mut [u8]) -> Result<usize> { |
42 |
| - let mut read = 0; |
43 |
| - while buf.len() > 0 { |
| 62 | +fn read_exact(&mut self, mut buf: &mut [u8]) -> Result<()> { |
| 63 | + while !buf.is_empty() { |
44 | 64 | match self.read(buf) {
|
45 | 65 | Ok(0) => break,
|
46 |
| - Ok(n) => { read += n; let tmp = buf; buf = &mut tmp[n..]; } |
| 66 | + Ok(n) => { let tmp = buf; buf = &mut tmp[n..]; } |
47 | 67 | Err(ref e) if e.kind() == ErrorKind::Interrupted => {}
|
48 | 68 | Err(e) => return Err(e),
|
49 | 69 | }
|
50 | 70 | }
|
51 |
| - Ok(read) |
52 |
| -} |
53 |
| - |
54 |
| -fn read_exact(&mut self, buf: &mut [u8]) -> Result<()> { |
55 |
| - if try!(self.read_full(buf)) != buf.len() { |
| 71 | + if !buf.is_empty() { |
56 | 72 | Err(Error::new(ErrorKind::UnexpectedEOF, "failed to fill whole buffer"))
|
57 | 73 | } else {
|
58 | 74 | Ok(())
|
59 | 75 | }
|
60 | 76 | }
|
61 | 77 | ```
|
62 | 78 |
|
63 |
| -Finally, a new `ErrorKind::UnexpectedEOF` will be introduced, which will be |
64 |
| -returned by `read_exact` in the event of a premature EOF. |
| 79 | +And an optimized implementation of this method for `&[u8]` is provided: |
| 80 | + |
| 81 | +```rust |
| 82 | +#[inline] |
| 83 | +fn read_exact(&mut self, buf: &mut [u8]) -> Result<()> { |
| 84 | + if (buf.len() > self.len()) { |
| 85 | + return Err(Error::new(ErrorKind::UnexpectedEOF, "failed to fill whole buffer")); |
| 86 | + } |
| 87 | + let (a, b) = self.split_at(buf.len()); |
| 88 | + slice::bytes::copy_memory(a, buf); |
| 89 | + *self = b; |
| 90 | + Ok(()) |
| 91 | +} |
| 92 | +``` |
| 93 | + |
| 94 | +The detailed semantics of `read_exact` are as follows: `read_exact` |
| 95 | +reads exactly the number of bytes needed to completely fill its `buf` |
| 96 | +parameter. If that's not possible due to an "end of file" condition |
| 97 | +(that is, the `read` method would return 0 even when passed a buffer |
| 98 | +with at least one byte), it returns an `ErrorKind::UnexpectedEOF` error. |
| 99 | + |
| 100 | +On success, the read pointer is advanced by the number of bytes read, as |
| 101 | +if the `read` method had been called repeatedly to fill the buffer. On |
| 102 | +any failure (including an `ErrorKind::UnexpectedEOF`), the read pointer |
| 103 | +might have been advanced by any number between zero and the number of |
| 104 | +bytes requested (inclusive), and the contents of its `buf` parameter |
| 105 | +should be treated as garbage (any part of it might or might not have |
| 106 | +been overwritten by unspecified data). |
| 107 | + |
| 108 | +Even if the failure was an `ErrorKind::UnexpectedEOF`, the read pointer |
| 109 | +might have been advanced by a number of bytes less than the number of |
| 110 | +bytes which could be read before reaching an "end of file" condition. |
| 111 | + |
| 112 | +The `read_exact` method will never return an `ErrorKind::Interrupted` |
| 113 | +error, similar to the `read_to_end` method. |
| 114 | + |
| 115 | +Similar to the `read` method, no guarantees are provided about the |
| 116 | +contents of `buf` when this function is called; implementations cannot |
| 117 | +rely on any property of the contents of `buf` being true. It is |
| 118 | +recommended that implementations only write data to `buf` instead of |
| 119 | +reading its contents. |
| 120 | + |
| 121 | +# About ErrorKind::Interrupted |
| 122 | + |
| 123 | +Whether or not `read_exact` can return an `ErrorKind::Interrupted` error |
| 124 | +is orthogonal to its semantics. One could imagine an alternative design |
| 125 | +where `read_exact` could return an `ErrorKind::Interrupted` error. |
| 126 | + |
| 127 | +The reason `read_exact` should deal with `ErrorKind::Interrupted` itself |
| 128 | +is its non-idempotence. On failure, it might have already partially |
| 129 | +advanced its read pointer an unknown number of bytes, which means it |
| 130 | +can't be easily retried after an `ErrorKind::Interrupted` error. |
| 131 | + |
| 132 | +One could argue that it could return an `ErrorKind::Interrupted` error |
| 133 | +if it's interrupted before the read pointer is advanced. But that |
| 134 | +introduces a non-orthogonality in the design, where it might either |
| 135 | +return or retry depending on whether it was interrupted at the beginning |
| 136 | +or in the middle. Therefore, the cleanest semantics is to always retry. |
| 137 | + |
| 138 | +There's precedent for this choice in the `read_to_end` method. Users who |
| 139 | +need finer control should use the `read` method directly. |
| 140 | + |
| 141 | +# About the read pointer |
| 142 | + |
| 143 | +This RFC proposes a `read_exact` function where the read pointer |
| 144 | +(conceptually, what would be returned by `Seek::seek` if the stream was |
| 145 | +seekable) is unspecified on failure: it might not have advanced at all, |
| 146 | +have advanced in full, or advanced partially. |
| 147 | + |
| 148 | +Two possible alternatives could be considered: never advance the read |
| 149 | +pointer on failure, or always advance the read pointer to the "point of |
| 150 | +error" (in the case of `ErrorKind::UnexpectedEOF`, to the end of the |
| 151 | +stream). |
| 152 | + |
| 153 | +Never advancing the read pointer on failure would make it impossible to |
| 154 | +have a default implementation (which calls `read` in a loop), unless the |
| 155 | +stream was seekable. It would also impose extra costs (like creating a |
| 156 | +temporary buffer) to allow "seeking back" for non-seekable streams. |
| 157 | + |
| 158 | +Always advancing the read pointer to the end on failure is possible; it |
| 159 | +happens without any extra code in the default implementation. However, |
| 160 | +it can introduce extra costs in optimized implementations. For instance, |
| 161 | +the implementation given above for `&[u8]` would need a few more |
| 162 | +instructions in the error case. Some implementations (for instance, |
| 163 | +reading from a compressed stream) might have a larger extra cost. |
| 164 | + |
| 165 | +The utility of always advancing the read pointer to the end is |
| 166 | +questionable; for non-seekable streams, there's not much that can be |
| 167 | +done on an "end of file" condition, so most users would discard the |
| 168 | +stream in both an "end of file" and an `ErrorKind::UnexpectedEOF` |
| 169 | +situation. For seekable streams, it's easy to seek back, but most users |
| 170 | +would treat an `ErrorKind::UnexpectedEOF` as a "corrupted file" and |
| 171 | +discard the stream anyways. |
| 172 | + |
| 173 | +Users who need finer control should use the `read` method directly, or |
| 174 | +when available use the `Seek` trait. |
| 175 | + |
| 176 | +# Naming |
| 177 | + |
| 178 | +It's unfortunate that `write_all` used `WriteZero` for its `ErrorKind`; |
| 179 | +were it named `UnexpectedEOF` (which is a much more intuitive name), the |
| 180 | +same `ErrorKind` could be used for both functions. |
| 181 | + |
| 182 | +The initial proposal for this `read_exact` method called it `read_all`, |
| 183 | +for symmetry with `write_all`. However, that name could also be |
| 184 | +interpreted as "read as many bytes as you can that fit on this buffer, |
| 185 | +and return what you could read" instead of "read enough bytes to fill |
| 186 | +this buffer, and fail if you couldn't read them all". The previous |
| 187 | +discussion led to `read_exact` for the later meaning, and `read_full` |
| 188 | +for the former meaning. |
65 | 189 |
|
66 | 190 | # Drawbacks
|
67 | 191 |
|
68 |
| -Like `write_all`, these APIs are lossy: in the event of an error, there is no |
69 |
| -way to determine the number of bytes that were successfully read before the |
70 |
| -error. However, doing so would complicate the methods, and the caller will want |
71 |
| -to simply fail if an error occurs the vast majority of the time. Situations |
72 |
| -that require lower level control can still use `read` directly. |
| 192 | +If this method fails, the buffer contents are undefined; the |
| 193 | +`read_exact' method might have partially overwritten it. If the caller |
| 194 | +requires "all-or-nothing" semantics, it must clone the buffer. In most |
| 195 | +use cases, this is not a problem; the caller will discard or overwrite |
| 196 | +the buffer in case of failure. |
73 | 197 |
|
74 |
| -# Unanswered Questions |
| 198 | +In the same way, if this method fails, there is no way to determine how |
| 199 | +many bytes were read before it determined it couldn't completely fill |
| 200 | +the buffer. |
75 | 201 |
|
76 |
| -Naming. Is `read_full` the best name? Should `UnexpectedEOF` instead be |
77 |
| -`ShortRead` or `ReadZero`? |
| 202 | +Situations that require lower level control can still use `read` |
| 203 | +directly. |
78 | 204 |
|
79 | 205 | # Alternatives
|
80 | 206 |
|
81 |
| -Use a more complicated return type to allow callers to retrieve the number of |
82 |
| -bytes successfully read before an error occurred. As explained above, this |
83 |
| -would complicate the use of these methods for very little gain. It's worth |
84 |
| -noting that git's `read_in_full` is similarly lossy, and just returns an error |
85 |
| -even if some bytes have been read. |
86 |
| - |
87 |
| -Only provide `read_exact`, but parameterize the `UnexpectedEOF` or `ShortRead` |
88 |
| -error kind with the number of bytes read to allow it to be used in place of |
89 |
| -`read_full`. This would be less convenient to use in cases where EOF is not an |
90 |
| -error. |
91 |
| - |
92 |
| -Only provide `read_full`. This would cover most of the convenience (callers |
93 |
| -could avoid the read loop), but callers requiring a filled buffer would have to |
94 |
| -manually check if all of the desired bytes were read. |
95 |
| - |
96 |
| -Finally, we could leave this out, and let every Rust user needing this |
97 |
| -functionality continue to write their own `read_full` or `read_exact` function, |
98 |
| -or have to track down an external crate just for one straightforward and |
99 |
| -commonly used convenience method. |
| 207 | +The first alternative is to do nothing. Every Rust user needing this |
| 208 | +functionality continues to write their own read_full or read_exact |
| 209 | +function, or have to track down an external crate just for one |
| 210 | +straightforward and commonly used convenience method. Additionally, |
| 211 | +unless everybody uses the same external crate, every reimplementation of |
| 212 | +this method will have slightly different error handling, complicating |
| 213 | +mixing users of multiple copies of this convenience method. |
| 214 | + |
| 215 | +The second alternative is to just add the `ErrorKind::UnexpectedEOF` or |
| 216 | +similar. This would lead in the long run to everybody using the same |
| 217 | +error handling for their version of this convenience method, simplifying |
| 218 | +mixing their uses. However, it's questionable to add an `ErrorKind` |
| 219 | +variant which is never used by the standard library. |
| 220 | + |
| 221 | +Another alternative is to return the number of bytes read in the error |
| 222 | +case. That makes the buffer contents defined also in the error case, at |
| 223 | +the cost of increasing the size of the frequently-used `io::Error` |
| 224 | +struct, for a rarely used return value. My objections to this |
| 225 | +alternative are: |
| 226 | + |
| 227 | +* If the caller has an use for the partially written buffer contents, |
| 228 | + then it's treating the "buffer partially filled" case as an |
| 229 | + alternative success case, not as a failure case. This is not a good |
| 230 | + match for the semantics of an `Err` return. |
| 231 | +* Determining that the buffer cannot be completely filled can in some |
| 232 | + cases be much faster than doing a partial copy. Many callers are not |
| 233 | + going to be interested in an incomplete read, meaning that all the |
| 234 | + work of filling the buffer is wasted. |
| 235 | +* As mentioned, it increases the size of a commonly used type in all |
| 236 | + cases, even when the code has no mention of `read_exact`. |
| 237 | + |
| 238 | +The final alternative is `read_full`, which returns the number of bytes |
| 239 | +read (`Result<usize>`) instead of failing. This means that every caller |
| 240 | +has to check the return value against the size of the passed buffer, and |
| 241 | +some are going to forget (or misimplement) the check. It also prevents |
| 242 | +some optimizations (like the early return in case there will never be |
| 243 | +enough data). There are, however, valid use cases for this alternative; |
| 244 | +for instance, reading a file in fixed-size chunks, where the last chunk |
| 245 | +(and only the last chunk) can be shorter. I believe this should be |
| 246 | +discussed as a separate proposal; its pros and cons are distinct enough |
| 247 | +from this proposal to merit its own arguments. |
| 248 | + |
| 249 | +I believe that the case for `read_full` is weaker than `read_exact`, for |
| 250 | +the following reasons: |
| 251 | + |
| 252 | +* While `read_exact` needs an extra variant in `ErrorKind`, `read_full` |
| 253 | + has no new error cases. This means that implementing it yourself is |
| 254 | + easy, and multiple implementations have no drawbacks other than code |
| 255 | + duplication. |
| 256 | +* While `read_exact` can be optimized with an early return in cases |
| 257 | + where the reader knows its total size (for instance, reading from a |
| 258 | + compressed file where the uncompressed size was given in a header), |
| 259 | + `read_full` has to always write to the output buffer, so there's not |
| 260 | + much to gain over a generic looping implementation calling `read`. |
| 261 | + |
0 commit comments