Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process::Command data loss when using Stdio::inherit for stdin and stdin has been read from #97855

Open
jgoerzen opened this issue Jun 8, 2022 · 6 comments
Labels
A-process Area: `std::process` and `std::env` C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@jgoerzen
Copy link

jgoerzen commented Jun 8, 2022

I tried this code:

Command::new(command).
...
.stdin(Stdio::inherit())

In this case, I had already read a few bytes from stdin using stdin() and read_exact. strace showed that it read 8K from stdin, and that the command started missed the remainder of the 8K. #58326 discusses this in a roundabout way, but fundamentally it should be safe to read from stdin and then subsequently use it as input to a Command. The workaround was to use .stdin(Stdio::piped()) and copy the data across the pipe. Not ideal.

Meta

rustc --version --verbose:

rustc 1.60.0 (7737e0b5c 2022-04-04)
binary: rustc
commit-hash: 7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c
commit-date: 2022-04-04
host: x86_64-unknown-linux-gnu
release: 1.60.0
LLVM version: 14.0.0
@jgoerzen jgoerzen added the C-bug Category: This is a bug. label Jun 8, 2022
@the8472
Copy link
Member

the8472 commented Jun 8, 2022

There's #78515 for switchable buffering for stdout. A similar API to switch to unbuffered IO could added for stdin.

@jgoerzen
Copy link
Author

jgoerzen commented Jun 8, 2022

That would provide a more elegant workaround for sure, but as a resolution it would be incomplete, since the default situation would still lead to data loss.

I would argue the default should be unbuffered, and a buffered stdin should be wrapped in BufReader by the user, that being a type that would be unsuitable for conversion to Stdio for a Command. That would go a long way towards preventing the user from doing something that is going to result in data loss.

@the8472
Copy link
Member

the8472 commented Jun 8, 2022

The data isn't lost, it's in the buffer. io::stdin being buffered by default is intentional because it provides a better experience for most of the use-cases which don't involve passing the input to another process.

And this is documented behavior

Each handle returned is a reference to a shared global buffer whose access is synchronized via a mutex.

@jgoerzen
Copy link
Author

jgoerzen commented Jun 8, 2022

I understand that, but being in the buffer is tantamount to being lost, because the buffer is not present in the input to the child process. I can't think of any use case where I would want to:

  1. Read some data from stdin, including up to 8K-1 bytes beyond what I was interested in (note that this size is implicit and not directly detectable or controllable by the user)
  2. Call a child process that is going to read from stdin starting beyond the point where the buffering code did the read
  3. Then process that extra data in the buffer

stdin is just another fd on Unix; I find it very odd that it is treated so differently (behind a mutex and all that) in Rust. But it's true I don't have to care about that so long as it actually works properly -- which I don't believe it does in this case.

@the8472
Copy link
Member

the8472 commented Jun 8, 2022

Currently you can access the underlying file directly this way:

let inp = stdin().lock();
let inp = ManuallyDrop::new(unsafe { File::from_raw_fd(inp.as_raw_fd()) });
inp.read_exact(&mut buf)?;
drop(inp);

// spawn command here.

stdin is just another fd on Unix; I find it very odd that it is treated so differently (behind a mutex and all that) in Rust.

The underlying syscalls are, sure. But most languages provide buffering and possibly thread-safety around stdio to avoid line-tearing and to speed up small reads/writes.

Even libc has buffering. https://man7.org/linux/man-pages/man3/setbuf.3.html

@jgoerzen
Copy link
Author

jgoerzen commented Jun 9, 2022

You are correct that this issue does exist in other languages in some cases; for instance, this from the popen(3) manpage on Linux:

       Since the standard input of a command opened for reading shares its seek offset with the process that  called  popen(),  if  the
       original  process  has  done a buffered read, the command's input position may not be as expected.  Similarly, the output from a
       command opened for writing may become intermingled with that of the original process.  The latter  can  be  avoided  by  calling
       fflush(3) before popen().

However, there are several differences to note:

  1. At least in my experience, when dealing with pipes/forks/etc, the usual way is to stick completely with the POSIX API (read/write/etc) rather than the ANSI C API (fdopen, fread, fwrite, etc) because of the oft-undesired side-effects of the C API in conjunction with these things (of which this is but one). This does not result in application-level buffering (there may be kernel-level buffering especially for terminals, but that is harmless in this case)
  2. The issue can be worked around with setbuf as you mention
  3. AFAICT, the issue is completely undocumented in Rust.

It may be too late for this wrt backwards compatibility, but if I were designing this, I wouldn't make stdin/out/err be special cases; BufReader and BufWriter could be used for them just like anything else if people want. But we are where we are.

So what to do about it?

  1. The problem should be documented at least in the stdin, Stdio, and Command locations.
  2. Providing a setting for it as you mentioned would be a workaround (a stdin version of Switchable buffering for Stdout #78515)
  3. Providing access to the raw unbuffered object (not fd, but the impl Read) would also be a workaround (as in Expose raw Stdout/err/in #58326). I think this has some potential advantage in allowing certain types to only accept unbuffered objects, forcing people to think about this... But then, this may be more challenging backwards-compatibility wise.
  4. Coincidentally with any of the above, we could also change the default buffering for stdin. AFAICT in the C API on Linux, stderr is unbuffered, stdout is line buffered, and buffering on stdin is undefined. (Of course, no userland buffering with the POSIX API). It is easy enough for stdout/stderr to be flushed before being used in a subprocess, but there's no syscall to "unread" stuff from stdin.

Fundamentally people shouldn't have to resort to using strace, as I did, to discover that read_exact didn't do what it said on the tin.

I question about your ManuallyDrop use in the example above. I assume this is because when the File::from_raw_fd is dropped, it closes the underlying fd, which may not be desired. In that case, probably the explicit drop should occur after the spawn? (And I believe that if there was an error from read_exact, there would be a memory leak there, right?)

Perhaps an alternative would be to explicitly std::mem::drop the File after the spawn?

@workingjubilee workingjubilee added the A-process Area: `std::process` and `std::env` label Jul 22, 2023
@ChrisDenton ChrisDenton added T-libs Relevant to the library team, which will review and decide on the PR/issue. and removed needs-triage-legacy labels Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-process Area: `std::process` and `std::env` C-bug Category: This is a bug. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

4 participants