Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky stderr of child processes with libgreen #12827

Closed
alexcrichton opened this issue Mar 11, 2014 · 1 comment · Fixed by #13619
Closed

Flaky stderr of child processes with libgreen #12827

alexcrichton opened this issue Mar 11, 2014 · 1 comment · Fixed by #13619
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows

Comments

@alexcrichton
Copy link
Member

These two programs replicate the flakiness which I've seen on bors for the past few months. This flakiness can also be seen on travis.

This test only fails on linux, not on OSX. It also only fails once every 2000 child processes or so.

// foo.rs
use std::io::Process;
use std::str;

fn main() {
    static N: int = 10;
    let (tx, rx) = channel();
    for _ in range(0, N) {
        let tx = tx.clone();
        spawn(proc() {
            let p = Process::output("./bar", []).unwrap();
            assert!(p.status.matches_exit_status(101), "{}", p.status);
            let out = str::from_utf8(p.output).unwrap();
            if out != "" {
                println!("{}", out);
            }
            assert_eq!(str::from_utf8(p.error).unwrap(),
                       "task '<main>' failed at 'test', bar.rs:2\n");
            tx.send(());
        });
    }
    drop(tx);
    for _ in range(0, N) { rx.recv() }
}


// bar.rs
fn main() {
    fail!("test");
}

It's key that within one instance of foo many instances of bar are run in parallel. I'm not quite sure why just yet.

$ for i in {0..1000}; do echo $i; ./foo || break; done
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
task '<unnamed>' failed at 'assertion failed: `(left == right) && (right == left)` (left: `task '<main>' failed at '
test', `, right: `task '<main>' failed at 'test', bar.rs:2
`)', foo.rs:12
task '<main>' failed at 'receiving on a closed channel', /home/alex/rust/src/libstd/comm/mod.rs:507
@alexcrichton
Copy link
Member Author

Brain dump of what I know so far


The basic bug is that often when using libuv, our child subprocesses
have their stderr/stdout file descriptors closed too early, so the
output we read from them is truncated. This basically ends up in lots
of failing tests because the output was "te" instead of "test".

I haven't been able to minimize it in just libuv C code yet. Some key things I've noticed
so far

  • The child must write() multiple times to the output file descriptor
  • I've only reproduced when the parent is multithreaded
  • The child doesn't need to run libuv to expose the bug
  • This only happens on linux
  • The child is receiving an EPIPE when writing to the stderr file descriptor

After investigation with strace on linux, I've deduced the following
set of events:

  • The parent partially reads data from the child pipe
  • The parent then invokes uv__stream_eof because it thinks the pipe is dead
  • The parent then goes through the rust rigamarole to close the stream
  • The child then writes to the stderr file descriptor, receiving EPIPE

I'm not super familiar with epoll, but it looked like the parent was
receiving "POLLHUP | POLLIN" for the child pipe, causing it to trigger
this code: https://github.com/rust-lang/libuv/blob/master/src/unix/stream.c#L1139-L1151

I'm not 100% certain if that code is relevant to this, but I did find
that commenting the code out made the tests reliably past, so it seems
like it may be closing the stream prematurely sometimes.


If it helps, I was able to get this strace output:
https://gist.github.com/anonymous/9751330 and the most relevant
portions are https://gist.github.com/anonymous/9751326

There's a whole lot of stuff there. I modified the child process to
write() a few times, and exit with 101 if it succeeded in write, and
exit with 100 if it failed in write(). At
https://gist.github.com/anonymous/9751326#file--L329 is where the
child exits with 100 because the previous write returned EPIPE (errno
-32).

Things that I managed to piece together. You'll see a bunch of write()
syscalls on fd 0, but I injected those. I added them so they'd show up
in the strace output and the input pointer is the bit pattern that I
was looking for (and corresponded to various different events).

  1. The parent (pid 2235) called socketpair(), creating fds (79, 82)
  2. The child dup'd 82 into its stderr (fd 2), then closed 79
  3. The parent closed 82
  4. The parent saw EPOLLIN|EPOLLHUP on fd 79
  5. The parent read some data
  6. The parent's uv callback was invoked with the partial data
  7. The parent's uv callback was invoked with UV_EOF
  8. The parent closed fd 79
  9. The child attempted to write again to fd 79, resulting in EPIPE

[1] - https://gist.github.com/anonymous/9751326#file--L1
[2] - https://gist.github.com/anonymous/9751326#file--L26-L30
[3] - https://gist.github.com/anonymous/9751326#file--L94
[4] - https://gist.github.com/anonymous/9751326#file--L287
[5] - https://gist.github.com/anonymous/9751326#file--L311-L312
[6] - https://gist.github.com/anonymous/9751326#file--L313-L314
[7] - https://gist.github.com/anonymous/9751326#file--L315-L316
[8] - https://gist.github.com/anonymous/9751326#file--L319-L321
[9] - https://gist.github.com/anonymous/9751326#file--L326-L327

The first longer gist has a bunch of other successful children, the
smaller gist just focuses on pids 2235 and 2243 in what I thought were
the relevant areas.

bors added a commit that referenced this issue Apr 24, 2014
This update brings a few months of changes, but primarily a fix for the
following situation.

When creating a handle to stdin, libuv used to set the stdin handle to
nonblocking mode. This would end up affect this stdin handle across all
processes that shared it, which mean that stdin become nonblocking for everyone
using the same stdin. On linux, this also affected *stdout* because stdin/stdout
roughly point at the same thing.

This problem became apparent when running the test suite manually on a local
computer. The stdtest suite (running with libgreen) would set stdout to
nonblocking mode (as described above), and then the next test suite would always
fail for a printing failure (because stdout was returning EAGAIN).

This has been fixed upstream, joyent/libuv@342e8c, and this update pulls in this
fix. This also brings us in line with a recently upstreamed libuv patch.

Closes #12827
Closes #13336
Closes #13355
fasterthanlime pushed a commit to fasterthanlime/rust that referenced this issue Jul 22, 2022
internal: Construct fewer `AstIdMap`s in lowering
flip1995 pushed a commit to flip1995/rust that referenced this issue Jun 13, 2024
`lint_groups_priority`: ignore lints & groups at the same level

Fixes rust-lang#12270

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant