Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for read_all #980

Merged
merged 13 commits into from
Aug 7, 2015
Merged

RFC for read_all #980

merged 13 commits into from
Aug 7, 2015

Conversation

novalis
Copy link
Contributor

@novalis novalis commented Mar 15, 2015

returned. After a read call returns having successfully read some
bytes, the total number of bytes read will be updated. If that
total is equal to the size of the buffer, read will return
successfully.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for introducing a new error kind instead of following the pattern of read()'s return value (io::Result<uint>)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Consistency with write_all.
  2. io::Result is either Ok(uint) or Error. read_all has only one kind of ok result: "yep, we read the whole thing". If instead we returned Ok(number of bytes read), then every caller would have to check that the number of bytes read matched their expectations. In other words, we are optimizing for the common case of "either this file has the data, or we're doomed". And because the error on EOF is distinct from other errors, callers who care can choose to match on it and handle it separately. But I expect that most callers will not care.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, we are optimizing for the common case of "either this file has the data, or we're doomed".

Who says that that's the common case? What about: "Fill this buffer as much as possible without me having to check for EINTR"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one could use take + read_to_end to do something like that (of course, it allocates a new buffer instead of re-using an old one). It seems likely to me that the "eof is a temporary condition rather than an error" applies more to network streams, and re-using buffers in network stream situations is less safe: http://www.tedunangst.com/flak/post/heartbleed-in-rust

I guess my only argument for read_all being the common case is that I've seen it twice (once in the code I am now writing, and once in the link above), compared to zero for the other one.

I could put this into the rationale section, if you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, it allocates a new buffer instead of re-using an old one

Exactly.

It seems likely to me that the "eof is a temporary condition rather than an error"

I wasn't talking about EOF. I was talking about EINTR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a four byte difference between i32 errno and u64 usize that we're talking here?

No that's probably manageable, I just wanted to point out that the size of io::Error is a very real worry and it can be affected by ErrorKind depending on how we're implementing it.

Making a separate error type for this method might be the most straightforward approach.

This is possible, but has drawbacks:

  • More API surface area to stabilize
  • Nonstandard as the very similar write_all API would probably also want this treatment.
  • Can hinder understanding by returning a non-normal error type.
  • Operating generically over I/O functions may be hindered as this would stick out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think write_all can't use a separate error type. The write_all fn should never return WriteZero on a POSIX system (unless a zero-length buffer is passed); instead, it will either succeed or return some other error. So in fact we would have need to annotate all of the other io errors with the number of bytes written. I don't know how useful this would be, and it seems like an odd wart to have to have on io::Error given that it's only useful for write_all and read_all.

By contrast, read_all could just add the number of bytes written to ShortRead. That seems simpler, and I would be happy to adjust this proposal and the implementing Pull Request if that's the consensus.

I don't understand how operating generically would be hindered -- can you explain a bit? I'm not very familiar with the language.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcrichton, the second case (Some bytes were read, and then we saw Ok(0)) is the only case for which I would generally care about knowing how many bytes were read. For the I/O error case, the vast majority of the time I don't care about how many bytes were read, I just bail out and return the error. I'm much more likely to care in the Ok(0) case, such as if I'm reading fixed-size records from a file or the network, but the last record is allowed to be short.

Since the EOF condition doesn't exist for write_all, no changes would be needed there for consistency.

One way to implement this would be as follows:

  • Add ShortRead(usize) to Repr
  • Also add ShortRead(usize) to ErrorKind
  • Add Repr::ShortRead(n) => ErrorKind::ShortRead(n) to Error::kind (and also update description and detail)

This would not increase the size of io::Error. (io::Error already has Custom(Box<Custom>), which is the same size). It would increase the size of Custom, but that always gets boxed. It would also allow the return value to stay io::Result<()>, as desired, and would maintain usability by keeping the check for a short read out of the normal program flow and allowing a short read to be treated as a normal error value (via try!, et cetera) unless the user explicitly wants to handle it differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@novalis

I don't understand how operating generically would be hindered -- can you explain a bit?

This may not be too too common, but if you are operating in a context that expects io::Result<T> to be returned via something like:

fn foo<F>(f: F) -> io::Result<T> 
    where F: Fn() -> io::Result<T> {
    // ...
}

You can't naturally use read_all with a different error type as you would be able to otherwise. You can probably get around it with try! and Ok, but it's somewhat unfortunate to have to do so.


@rkjnsn

Since the EOF condition doesn't exist for write_all, no changes would be needed there for consistency.

Technically Ok(0) is indeed special as cases like Write on &mut [u8] will end up generating this once you hit the end of the buffer.

One way to implement this would be as follows:

Hm yes, that does seem plausible! I would still be a little wary of adding this error, but we do have precedent with WriteZero, so it may no be so bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcrichton
Technically Ok(0) is indeed special as cases like Write on &mut [u8] will end up generating this once you hit the end of the buffer.

Really? I would expect and strongly argue that that should generate an IO error, even for a standard write call, and not return Ok(0). In my mind, this is analogous to a pipe or a socket with the remote end closed, where reading yields end of file, while writing results in an error. As perhaps a closer analogy to the fixed buffer case, on Linux, if I have a fixed-size block device (say a small ramdisk), and I read to the end, subsequent reads will return 0 (end of file). If, however, I reach the end of the file and try to write more, I'll receive an error (ENOSPC). Similarly, writing to a file that has reached the maximum allowed size results in (EFBIG), not a successful write of zero size.

Is there any chance this can be changed?

@rkjnsn
Copy link
Contributor

rkjnsn commented Mar 19, 2015

👍 to this in its current form.


If we wanted io::Error to be a smaller type, ErrorKind::ShortRead
could be unparameterized. But this would reduce the information
available to calleres.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would note that having a parameterized ErrorKind::ShortRead wouldn't actually increase the size of io::Error, only that of ErrorKind.

@novalis
Copy link
Contributor Author

novalis commented Mar 22, 2015

Thanks for all the comments so far. Is there anything else I should be doing here? Or should I just be patient and let the process work its magic?

@aturon
Copy link
Member

aturon commented Mar 23, 2015

@novalis

Thanks for the PR! We're a little slow dealing with RFCs at the moment, since the focus is on pushing out the beta release next week.

This RFC touches on some issues that were discussed pretty extensively during IO reform:

In particular, we used to have some convenience methods like this (see read_exact and read_at_least), but the consensus was that these weren't pulling their weight.

Note that it's quite straightforward to build libraries outside of std that supplement the behavior of all readers with conveniences like the one proposed here, so it's not critical that std provide every last convenience. (Over time, we expect the set to grow as external libraries become de facto standard.)

Note also that there's an inherent asymmetry between reading and writing here, in that it's easy to get a slice of an entire existing data structure (Vec<u8>, say), whereas it's perhaps a bit less common to have a perfectly-sized slice to read into (though I can certainly imagine some cases like this).

In short, it'd help to have more discussion about the scenarios in which this convenience is a big win, to the point that it should be included in the standard library (that has just been pared down), keeping in mind that std can be extended over time.

cc @alexcrichton

@novalis
Copy link
Contributor Author

novalis commented Mar 25, 2015

Note also that there's an inherent asymmetry between reading and writing here, in that it's easy to get a slice of an entire existing data structure (Vec<u8>, say), whereas it's perhaps a bit less common to have a perfectly-sized slice to read into (though I can certainly imagine some cases like this).

That's a very good point. The use cases I'm thinking about include the ones in byteorder, where a buffer whose size is known at compile-time could be allocated on the stack, and ones in my own code, where the read size is not known at compile-time. Both are fairly common in applications I have seen. Given read_all(&mut [u8]), users can always allocate a Vec of the correct size and fill it up. But given read_all -> Result<Vec<u8>>, some allocation is always performed. So this imposes some additional cost over using the raw read function. RFC 517 says:

...the abstractions must be safe, and they should be as high-level as possible without imposing a tax...

So this definition of read_all, to me, seems like the only alternative that is safe but does not impose a tax.

That said, I didn't have to write byteorder, but I do have to write my application, so from a purely selfish perspective, a read_all returning a Vec would be simpler. But I don't think it is a good idea.

In particular, we used to have some convenience methods like this (see read_exact and read_at_least), but the consensus was that these weren't pulling their weight.

read_at_least seemed to be a complicated optimization for an uncommon use case, so I can definitely see why it was pulled. It looks like read_exact was pulled for safety reasons. But my proposed read_all is not unsafe, so that is no objection.

Here's one scenario where I would use read_all: reading git pack files: https://www.kernel.org/pub/software/scm/git/docs/technical/pack-format.txt

(deltified representation)
...
20-byte base object name
....

If I'm reading the "20-byte base object name", I want to use read_all because (a) a premature eof is an error and (b) I know exactly how many bytes I want to read (and probably have a buffer for it). Looking at the git source code, the approximately equivalent function ("read_in_full") is called a couple dozen times. A few are for read_to_end scenarios, but most are for cases where users simply don't want to have to deal with EINTR.

@nwin
Copy link

nwin commented May 2, 2015

👍 I was just about the write the same RFC.

@ArtemGr
Copy link

ArtemGr commented May 2, 2015

Given read_all(&mut [u8]), users can always allocate a Vec of the correct size and fill it up.

I'd much prefer this version to look like read_all (&mut Write, usize), in this way one can stream from Read to Write seamlessly.

@nwin
Copy link

nwin commented May 2, 2015

I'd much prefer this version to look like read_all (&mut Write, usize), in this way one can stream from Read to Write seamlessly.

That is a completely different functionality. What you are proposing is a different (admittedly better) signature for read_to_end(). read_all(&mut [u8]) is about reading exactly n bytes. It is the equivalent of Write::write_all.

@ArtemGr
Copy link

ArtemGr commented May 2, 2015

No, I'm proposing reading exactly n bytes too, hence the second argument.

Admittedly, the Write version isn't as convenient to use with slices. Perhaps there's a space for both the slice and the Write variants.

@nwin
Copy link

nwin commented May 2, 2015

Ahh I see, I was confused by the name. One should better call it read_exact then (or maybe stream?). But I think this is worth a new RFC.

write_all.

Or we could leave this out, and let every Rust user write their own
read_all function -- like savages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary. Cargo makes us a highly developed distributed culture. With minimal investment needed to contribute & make a difference.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary. Cargo makes us a highly developed distributed culture.

If you mean we don't need a standard library because we have Cargo then I disagree strongly. Putting things off into crates was a strategy used to reach a stable 1.0 library in time, but the standard library should catch up eventually. If you do a thing, you should do it good. If standard library does reading and writing, then it should cover such basics as filling an entire buffer in face of interrupts.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArtemGr Completely agree

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm disagreeing with saying we're savages if it's not in libstd. And we don't have to write it ourselves -- it's easy to reuse code using cargo, very easy. Not saying that's a replacement but an augmentation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not savages, but babies. = )
It's 1.0 beta and there is space to grow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you do this, calling people names is beside the point of the RFC and doesn't help. I'll reply with some on-topic questions in the main discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huuuuuge -1 for 'like savages' here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling people names is beside the point of the RFC and doesn't help

Sorry it didn't help, @bluss. I had to try. = }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ignoring the last line of this RFC. As has already been pointed out there is a discrepancy between reading/writing with the std::io as it currently stands. I comments on here would also reenforce that its already catching a number of us out. Simply deferring to cargo for a single function will also create problems. Not only in the cost of tracking down the correct crate (maybe byteorder in this case?) but also the cost of pull in all of its dependencies etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crate can be designed for that use case though: small and without dependencies. Using cargo is super easy IMO. Since crates may even re-export other crates, you can even ask authors to split out features to sub crates (when it's possible).

@bluss
Copy link
Member

bluss commented May 2, 2015

There are already two ways that I know of to have Rust perform the read loop to fill a buffer. One using Vec and one to a stack-allocated buffer. We employ composition (Take adaptor) here, which is commonly favoured in Rust.

I understand the I/O API will grow, but to formulate the way forward we need to understand what we already have and the problems with it. A complete RFC should certainly address it. To the code: (playpen link)

use std::io::{self, Read};

fn main() {
    const N: usize = 32;

    let data: &[u8] = &[1u8; 1024];

    // Use std::io::copy to fill a stack buffer
    // Documented to retry and continue on interrupt.
    let mut buf = [0u8; N];
    let res = io::copy(&mut data.take(N as u64), &mut &mut buf[..]);
    println!("{:?}", res);
    println!("{:?}", buf);

    // Use .read_to_end to read to a vec.
    // Documented to retry and continue on interrupt.
    let mut vec = Vec::with_capacity(N);
    let res = data.take(N as u64).read_to_end(&mut vec);
    println!("{:?}", res);
    println!("{:?}", vec);
}

@ArtemGr
Copy link

ArtemGr commented May 2, 2015

Didn't know we have copy. And the take does seem to do the job. Looks like we have it all covered.

@markuskobler
Copy link

@bluss thanks for following up with an example. But is the big downside of using copy/take over the proposed rfc is the internal allocation http://doc.rust-lang.org/1.0.0-beta.3/src/std/io/util.rs.html#31-44 that would not be needed with https://github.com/rust-lang/rust/pull/23369/files#diff-668f8f358d4a93474b396dcb3727399eR202

@novalis novalis force-pushed the novalis/read-all branch from acd01bc to c7f633b Compare May 3, 2015 02:27
@novalis
Copy link
Contributor Author

novalis commented May 3, 2015

@bluss I tried take when I was first writing this code, but it seems that File.take works differently than &[u8].take (?). I took your sample and rewrote it to use File: https://gist.github.com/novalis/160eae78e90900af0f14

When I go to compile it, I get:

test1.rs:29:15: 29:19 error: use of moved value: `file`
test1.rs:29     let res = file.take(N as u64).read_to_end(&mut vec);
                          ^~~~
test1.rs:22:29: 22:33 note: `file` moved here because it has type `std::fs::File`, which is non-copyable
test1.rs:22     let res = io::copy(&mut file.take(N as u64), &mut &mut buf[..]);
                                        ^~~~
error: aborting due to previous error

I probably just don't know rust.

@nwin
Copy link

nwin commented May 3, 2015

@bluss: Both of your examples fail to fail (link) if you cannot take N bytes from data. That is the main point of read_all.

@bluss
Copy link
Member

bluss commented May 4, 2015

@novalis You'd have to use .by_ref() with a file. I'm not advocating the specific io::copy API because it's very tricky to call.

@nwin Is it the main point? The first step is to perform the retry loop at all, I guess .read_to_end() is the best/only viable way to do that in Rust. To fail if less is read, I think the name .read_exact() is better in that case.

@markuskobler Yes, io::copy is clearly deficient. The way its API works is crummy, and it uses a big buffer, so it's not for this use case at all.

@novalis
Copy link
Contributor Author

novalis commented May 5, 2015

.by_ref().take(N) works for me. Sure, I have to check how many bytes have been read, but I would have to do that somehow anyway.

@novalis novalis closed this May 5, 2015
@nwin
Copy link

nwin commented May 5, 2015

@bluss: my main point is the asymmetry between Read and Write. If we have write_all we should also have read_all. It is equally usable.

@markuskobler
Copy link

@nwin 👍

@cesarb
Copy link
Contributor

cesarb commented Jul 16, 2015

@gsingh93 I'd prefer to merge it here so as to not split the discussion, if @novalis is okay with it. This is the first time I meddle with a Rust RFC, so I'm not sure what the correct procedure would be. @alexcrichton , @aturon , any advice?

@gsingh93
Copy link

@cesarb You can submit a PR to https://github.com/novalis/rfcs/tree/novalis/read-all, and if it gets merged it should show up here I think.

@alexcrichton
Copy link
Member

I'm fine whichever way you prefer @cesarb, merging into here would involve doing what @gsingh93 mentioned (sending a pr to @novalis), and then it's up to @novalis to merge it or not.

Make this RFC be again about a single method
@novalis
Copy link
Contributor Author

novalis commented Jul 17, 2015

Ok, I merged your PR.

@alexcrichton
Copy link
Member

Thanks @novalis and @cesarb! I think this will probably be ready to go back into FCP next week.

@cesarb
Copy link
Contributor

cesarb commented Jul 20, 2015

I added an explanation for why the contents of buf should be undefined on failure, and wrote a proposed implementation (with working tests). I will not submit a pull request for the proposed implementation until this RFC becomes "active", I wrote it only to help the FCP discussion; it should be visible in the timeline right above this comment (in case it isn't visible, it's at cesarb/rust@0a1b44a).

@cesarb
Copy link
Contributor

cesarb commented Jul 20, 2015

And, just for fun, an example of how this would be used in real life:

extern crate byteorder;

use byteorder::{BigEndian, ReadBytesExt, WriteBytesExt};
use std::io::{Read, Write, Result};

struct Header {
    tag: [u8; 16],
    number: u64,
}

impl Header {
    pub fn read<R: Read>(reader: &mut R) -> Result<Self> {
        let mut tag = [0; 16];
        try!(reader.read_exact(&mut tag));
        let number = try!(reader.read_u64::<BigEndian>());
        Ok(Header { tag: tag, number: number })
    }

    pub fn write<W: Write>(&self, writer: &mut W) -> Result<()> {
        try!(writer.write_all(&self.tag));
        try!(writer.write_u64::<BigEndian>(self.number));
        Ok(())
    }
}

@gsingh93
Copy link

@alexcrichton Can this be put back into it's FCP? I'm hoping to get this in the 1.3 beta so I don't want it to get too delayed.

@alexcrichton
Copy link
Member

Thanks for the ping @gsingh93, we were actually just in the libs triage meeting and we did indeed decide to put this back in FCP. This won't make the 1.3 window as #[stable] either way, though, unfortunately.

Anyway, this RFC is now entering its week-long final comment period.

@alexcrichton alexcrichton added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Jul 29, 2015

``` rust
fn read_exact(&mut self, mut buf: &mut [u8]) -> Result<()> {
while !buf.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this while should just be a loop?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this while should just be a loop?

If it were a loop, we would always do one extra call of self.read with a zero-length buf before exiting the loop. That's wasteful and could confuse poorly-implemented Read instances.

(Example: we had 10 bytes left to read, and self.read(buf) returned Ok(10). Now buf has length zero, and we go again to the top of the loop. Without the test for !buf.is_empty(), we would call again self.read(buf) with a zero-length buf.)

@rkjnsn
Copy link
Contributor

rkjnsn commented Aug 6, 2015

I'm okay, splitting these methods up for easier discussion/approval. However, as I've said before, if we were to only get one, I'd rather have read_full than read_exact as it covers most of the convenience while allowing several additional use cases. Thus, my support of this RFC in its current form is contingent on assurance that accepting it won't decrease the probability of a future read_full RFC being approved.

@aturon aturon merged commit 67491eb into rust-lang:master Aug 7, 2015
@aturon
Copy link
Member

aturon commented Aug 7, 2015

The libs team, after another round of discussion, has decided to merge this RFC as written. Thanks for the detailed writeup of the design tradeoffs in particular!

Tracking issue

cesarb added a commit to cesarb/rust that referenced this pull request Aug 24, 2015
This implements the proposed "read_exact" RFC
(rust-lang/rfcs#980).
bors added a commit to rust-lang/rust that referenced this pull request Aug 30, 2015
This implements the proposed "read_exact" RFC (rust-lang/rfcs#980).

Tracking issue: #27585
@Centril Centril added the A-input-output Proposals relating to std{in, out, err}. label Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-input-output Proposals relating to std{in, out, err}. final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.