Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Dec 1, 2020

Summary:
If tensorboard is to launch RustBoard as a subprocess rather than
relying on the user to launch it concurrently, then we should try hard
not to leave around a zombie server. An easy solution is an atexit
handler in Python TensorBoard, but this doesn’t handle the case when
TensorBoard crashes or gets SIGTERMed. On Linux, we can be notified when
our parent dies by setting the parent-death signal (man 2 prctl), but
this isn’t portable. On portable Unix, the child can poll its PPID to
see when it changes to 1 or the PID of some subreaper, but this isn’t
portable to Windows, which doesn’t update PPID when the parent dies.

Windows is not my strong suit, but some web searching didn’t turn up an
easy and clean solution portable to both Windows and Unix (any Windows
internals like “job objects” are a non-starter, since I can’t test
them). The one thing that I would expect to work everywhere is “just
wait until stdin closes and then exit”. If even that doesn’t work on
Windows, well, we can burn that bridge when we come to it.

Test Plan:
Build bazel build //tensorboard/data/server, then write a simple
Python driver:

import subprocess
import time

server = "bazel-bin/tensorboard/data/server/server"
p = subprocess.Popen(
    [server, "--logdir", "/tmp/nonexistent", "-v", "--die-after-stdin"],
    stdin=subprocess.PIPE,
)
print(p.pid)
time.sleep(2)

Run it with python test.py, and note that after 2 seconds the server
prints “Stdin closed; exiting”. Run it again, and suspend (^Z) the
process before it finishes sleeping. Run kill -SIGCONT CHILD_PID with
the PID printed by the Python script to resume server execution; this is
just needed because your ^Z propagates to all processes in the group.
Note that the server continues to print logs as it starts and finishes
new load cycles. Then, run fg and wait for the sleep to complete, and
note that the server again exits, as desired.

wchargin-branch: rust-die-after-stdin

Summary:
If `tensorboard` is to launch RustBoard as a subprocess rather than
relying on the user to launch it concurrently, then we should try hard
not to leave around a zombie server. An easy solution is an `atexit`
handler in Python TensorBoard, but this doesn’t handle the case when
TensorBoard crashes or gets SIGTERMed. On Linux, we can be notified when
our parent dies by setting the parent-death signal (`man 2 prctl`), but
this isn’t portable. On portable Unix, the child can poll its PPID to
see when it changes to `1` or the PID of some subreaper, but this isn’t
portable to Windows, which doesn’t update PPID when the parent dies.

Windows is not my strong suit, but some web searching didn’t turn up an
easy and clean solution portable to both Windows and Unix (any Windows
internals like “job objects” are a non-starter, since I can’t test
them). The one thing that I would expect to work everywhere is “just
wait until stdin closes and then exit”. If even that doesn’t work on
Windows, well, we can burn that bridge when we come to it.

Test Plan:
Build `bazel build //tensorboard/data/server`, then write a simple
Python driver:

```python
import subprocess
import time

server = "bazel-bin/tensorboard/data/server/server"
p = subprocess.Popen(
    [server, "--logdir", "/tmp/nonexistent", "-v", "--die-after-stdin"],
    stdin=subprocess.PIPE,
)
print(p.pid)
time.sleep(2)
```

Run it with `python test.py`, and note that after 2 seconds the server
prints “Stdin closed; exiting”. Run it again, and suspend (`^Z`) the
process before it finishes sleeping. Run `kill -SIGCONT CHILD_PID` with
the PID printed by the Python script to resume server execution; this is
just needed because your `^Z` propagates to all processes in the group.
Note that the server continues to print logs as it starts and finishes
new load cycles. Then, run `fg` and wait for the sleep to complete, and
note that the server again exits, as desired.

wchargin-branch: rust-die-after-stdin
wchargin-source: ad925e07a2106ee37d268e32c1b997606326ae99
@wchargin wchargin added the core:rustboard //tensorboard/data/server/... label Dec 1, 2020
@google-cla google-cla bot added the cla: yes label Dec 1, 2020
@wchargin wchargin requested a review from stephanwlee December 1, 2020 02:07
Copy link
Contributor

@stephanwlee stephanwlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but am wondering if there is an other alternative.

fn die_after_stdin() {
let stdin = std::io::stdin();
let stdin_lock = stdin.lock();
for _ in stdin_lock.bytes() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My buffer of comment went away after some GitHub page navigation :(

Are you intending to use stdin: PIPE on the parent process side? It feels a bit too key logger like for my liking. Just wondering, if we manually pass a stream to the stdin and rely on it closing upon crash?

Copy link
Contributor Author

@wchargin wchargin Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My buffer of comment went away after some GitHub page navigation :(

That sucks; sorry to hear that. :-(

Yep, stdin=PIPE. I’ve just finished a patch for this; it works great.
(edit: That patch is now in #4411.)

Not sure what you mean by keylogger-like? With stdin=PIPE, a new pipe
is created, and it’s never connected to a tty or other input source. We
just use it to send a single bit of metadata—the “close” signal.

In other words, from the subprocess’s perspective it’s stdin, but it’s
not the user’s stdin.

Is this about privacy? If TensorBoard wanted to log your keystrokes, we
could just do that, with no need for cloak and dagger.

Opening a pipe with input stream other than 0 (stdin) is harder and less
portable, which is why I’ve avoided it. Reading from it on the Rust side
requires using unsafe, since you’re taking an arbitrary int and
asserting that it corresponds to an fd over which you have exclusive
ownership, which there is no way to verify. And on the Python side you
need, like, close_fds=False, and you probably need to think more about
what Windows means by “file handle”… it seems more complicated, and
I don’t see a real benefit from it.

edit to add: FromRawFd is also Unix-only in Rust, which suggests
to me that it just doesn’t work in Windows, presumably because handles
are different enough from fds that it doesn’t make sense.

I am certainly open to alternatives. My criteria are that the solution:
(a) be portable without too much special casing, (b) not require special
testing on Windows, and (c) avoid unsafe in Rust if possible. I think
that this approach does well on (a) and (c), and does okay on (b);
ideally, I would like to test this on Windows at least once, but it’s
simple enough that I’m crossing my fingers that it just works. If you
can beat some or all of those axes, I’d love to hear it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by keylogger-like? With stdin=PIPE, a new pipe
is created, and it’s never connected to a tty or other input source. We
just use it to send a single bit of metadata—the “close” signal.

In other words, from the subprocess’s perspective it’s stdin, but it’s
not the user’s stdin.

Understand that but what does it get PIPEed to? In Node-ism, isn't this akin to process.stdin.pipe(subprocess.stdin)?

Is this about privacy? If TensorBoard wanted to log your keystrokes, we
could just do that, with no need for cloak and dagger.

Yes, I don't care if TensorBoard can do it today or whatever. More complex system that spans multiple process is harder to introspect to understand that and that is the problem at stake.

Others
No, I don't mean you should use file descriptor but merely create a stream object that you use to detect the closedness from the Rust process just like you have written so. Since I am not asking you to open up side stream other than the default stdin and stdout, there is no point in discussing the unsafe Rust-ism, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Node-ism, isn't this akin to process.stdin.pipe(subprocess.stdin)?

It’s not process.stdin.pipe (i.e., piping the parent process’s stdin),
no. It’s analogous to:

const child_process = require("child_process");
p = child_process.spawn("cat"); // stdin=PIPE is default in Node
p.stdout.on("data", (data) => void console.log("from cat: " + data));
p.stdin.write("hey\n");
// prints: "from cat: hey"
p.stdin.destroy();

I’ve searched for a bit to try to find a pure-Node “create a pipe”
equivalent but have come up empty. The hypothetical Node would be
roughly equivalent to:

// Create a new anonymous pipe with a read end and a write end connected
// to each other but not to anything else.
const [read, write] = newPipe();
read.pipe(subprocess.stdin);

The C example of what’s happening (ignoring error handling) is:

int pipefd[2];
pipe(pipefd); // new anonymous pipe from `pipefd[1]` to `pipefd[0]`
pid_t child = fork();
if (child < 0) abort(); // error in fork
if (child > 0) {
    // we're the parent
    close(pipefd[0]); // don't need to read from the pipe
    write(pipefd[1], "hey\n", 4);
    close(pipefd[1]); // send eof
    waitpid(child, NULL, 0);
    puts("parent done");
} else {
    // we're the child
    close(pipefd[1]); // don't need to write to the pipe
    dup2(pipefd[0], 0); // replace child stdin with the pipe's output
    execl("/bin/sed", "sed", "s/^/in child: /", NULL);
}

The key bits are the call to pipe(pipefd), which creates a new
anonymous pipe, and dup2(pipefd[0], 0), which replaces the child’s
stdin with the other end of the pipe.

Does this help answer “what does it get PIPEed to”? I think the
simplest answer really is just “nothing”.

No, I don't mean you should use file descriptor but merely create a
stream object that you use to detect the closedness from the Rust
process just like you have written so. Since I am not asking you to
open up side stream other than the default stdin and stdout, there
is no point in discussing the unsafe Rust-ism, right?

I don’t think I understand this. What does it mean to “create a stream
object” without “opening up a side stream”? Or, perhaps, what you mean
is what’s already happening at that pipe(pipefd).

Some more references, in case they help things click:

  • The Popen constructor says: “PIPE indicates that a new pipe
    to the child should be created […] With the default settings of
    None, no redirection will occur” (emphasis mine).
  • Node’s child_process says: “'pipe': Create a pipe between
    the child process and the parent process”; this is the default,
    rather than “'inherit': Pass through the corresponding stdio
    stream to/from the parent process” (emphasis mine).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also—

Yes, I don't care if TensorBoard can do it today or whatever. More complex system that spans multiple process is harder to introspect to understand that and that is the problem at stake.

—yep, that’s fair.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted offline. We think we’re talking about the same things: a Node
“stream” is a pipe with some associated machinery for handling events
and multicasting, and we’re using just the “pipe” part of that. Part of
the confusion was that subprocess.PIPE means “create new pipe (noun)”
rather than “pipe (verb) subprocess stdin to parent stdin”, which would
be “inherit” (stdin=None in Python).

wchargin-branch: rust-die-after-stdin
wchargin-source: 0be687779c1709eedf2114e402a4aebe7f35a647
@wchargin wchargin merged commit 96654d0 into master Dec 1, 2020
@wchargin wchargin deleted the wchargin-rust-die-after-stdin branch December 1, 2020 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes core:rustboard //tensorboard/data/server/...

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants