Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recv from channel hangs forever if other party is gone #12

Open
stepancheg opened this issue Sep 1, 2016 · 11 comments
Open

Recv from channel hangs forever if other party is gone #12

stepancheg opened this issue Sep 1, 2016 · 11 comments

Comments

@stepancheg
Copy link

stepancheg commented Sep 1, 2016

Code:

extern crate tokio_core;
extern crate futures;

use tokio_core::Loop;

use futures::stream::Stream;


#[test]
fn recv_after_close() {
    let mut lp: Loop = Loop::new().unwrap();

    let (sender, receiver) = lp.handle().channel::<u32>();

    let receiver = lp.run(receiver).unwrap();

    drop(lp);
    drop(sender);

    receiver.wait().next();
}

This test hangs. I cannot tell for sure, but seems like it worked (i. e. did not hang) two days ago.

@alexcrichton
Copy link
Contributor

Aha, after looking into this I think it's relatively expected behavior. When an I/O object (like this channel) is converted to a blocking future then the only way for it to be guaranteed to get resolved is if the event loop is running in the background somewhere. Here, though, the event loop isn't running, so I believe it was just by luck that before this was resolved.

If you change this to lp.run(receiver.into_future()) instead of receiver.wait(), it works out.

@stepancheg
Copy link
Author

It doesn't help:

extern crate tokio_core;
extern crate futures;

use tokio_core::Loop;

use futures::stream::Stream;


#[test]
fn recv_after_close() {
    let mut lp: Loop = Loop::new().unwrap();

    let (sender, receiver) = lp.handle().channel::<u32>();

    let receiver = lp.run(receiver).unwrap();

    drop(lp);
    drop(sender);

    let mut lp2 = Loop::new().unwrap();
    lp2.run(receiver.into_future());
}

Still hangs.

The problem with current bevaior (IMHO) is that if event loop is terminated for some reason (e. g. because of error), peers from others threads waiting for completion of some operation (e. g. response from server) simply hang instead of getting any error.

Because channel is bound to event loop, I think destruction of event loop should invalidate all associated objects (channels), and any operations on these channel should produce an error.

@alexcrichton
Copy link
Contributor

Oh sorry, to clarify I mean that for an I/O object to work the event loop it came from must still be alive somewhere. That is, because you drop lp, the receiver here won't work regardless of what happens next.

How come the event loop is dropping early for you? It's intended that event loops are very long lived (for the entire lifetime of a program), and it's only cleaned up once the entire world is being torn down.

@stepancheg
Copy link
Author

stepancheg commented Sep 2, 2016

How come the event loop is dropping early for you?

I can think of two scenarios:

  • Simply panic or logic error or whatever. Program should properly terminate even in that case.
  • Event loop is created to work with single client connection. When connection is closed by the server, event loop terminates, and all sessions associated with the connection should die. In my case it is grpc client, which works with single connection and serves multiple concurrent requests initiated by other threads.

@alexcrichton
Copy link
Contributor

Ah ok it looks like we may be working with different models of when event loops are created perhaps then? The Loop::new() function is quite expensive in terms of resource allocation, so it's not intended to be called once per client. Instead, each client is intended to run inside the same event loop as all the others.

Additionally it's assumed that panics in each client are isolated from all others. The catch_unwind combinator exists to catch a panic during a poll, and this would be used on the outer edge to make sure it doesn't take down the event loop.

@stepancheg
Copy link
Author

stepancheg commented Sep 2, 2016

The Loop::new() function is quite expensive in terms of resource allocation, so it's not intended to be called once per client.

Suppose, your client is a command line database client. That database client has a single connection per process, so it has event loop that serves that single connection. All connection-protocol handling logic happens inside that event loop, and high level synchronous blocking logic (like reading from file system etc.) may happen in another threads, and they communicate via channels. If event loop dies, other threads blocking on channels need to be notified of that fact.

Additionally it's assumed that panics in each client are isolated from all others. The catch_unwind combinator exists to catch a panic during a poll, and this would be used on the outer edge to make sure it doesn't take down the event loop.

Catch panic won't help, if future itself is completed with error without panic. Or simply completed successfully because there no more things to to. Have a look at this example:

extern crate tokio_core;
extern crate futures;

use std::io;
use std::sync::mpsc;
use std::thread;
use std::net::ToSocketAddrs;

use futures::Future;
use futures::stream::Stream;

use tokio_core::Loop;


#[test]
fn recv_after_close() {
    let (init_sender, init_receiver) = mpsc::channel();
    thread::spawn(move || {
        let mut lp: Loop = Loop::new().unwrap();

        let (sender, receiver) = lp.handle().channel::<u32>();

        let receiver = lp.run(receiver).unwrap();

        init_sender.send(receiver);

        let listener = lp.handle().tcp_listen(&("::1", 3600).to_socket_addrs().unwrap().next().unwrap());

        lp.run(listener.and_then(move |listener| {
            sender.send(10);
            listener.incoming().for_each(|_| Ok(()))
        }));
    });


    let receiver: tokio_core::Receiver<u32> = init_receiver.recv().unwrap();

    receiver.wait().next();

    println!("server started");
}

(I've created two channels because it is a demo, real example would be much larger).

It server fails to bind on port (e. g. if you replace 3600 with 36), program, again, hangs forever. This problem can be obviosly worked around, but the problem with current (new) API is that it is very easy to create such bugs.

It reminds me problems of CompletableFuture in java: if you forget to properly handle exception somewhere and do not fire error on future, program may hang (for example, if connection counter won't be decremented properly).

But the problem is unsolvable in Java, because Java has no RAII. And Rust has it. Why not take advantage of it to make API safer?

@alexcrichton
Copy link
Contributor

Yes unfortunately this is just something that fundamentally isn't designed to work. The source of "blocking" here is the event loop, as that's what the channel is bound to. If that's destroyed, the the channel has no ability to actually block.

What you'd want to use in a case like that is something like futures::stream::channel which doesn't rely on an event loop for blocking.

@stepancheg
Copy link
Author

If that's destroyed, the the channel has no ability to actually block.

Channel should not block, it should immediately return an error.

What you'd want to use in a case like that is something like futures::stream::channel which doesn't rely on an event loop for blocking.

Probably. I don't yet fully understand why different channels exist.

Anyway, I simply don't understand your motivation. AFAIU, it is possible to just make receiver not to hang. It is described in #17 . I does not practically add any overhead and makes API safer. Why do not do it?

@ArtemGr
Copy link

ArtemGr commented Mar 29, 2017

Perhaps related to this issue, I have an anecdotal example, spending several hours debugging while the tokio-core just wouldn't work. No errors, no nothing, scheduled futures just won't run in production while working perfectly in a unit test. Turns out it had to do with initializing a Core from inside the PHP's get_module function: my hypothesis is that the PHP-FPM calls get_module in the parent process and then forks (even though according to the Zend specs get_module should be called separately in every PHP process), - so the Core-driving thread would likely vanish with the fork.
What bugs me is that the tokio-core library was silent, no panics, no errors, everything's fine. It just refused to run the futures that we've handed to it. This, from a Rust library, is surprizing.
P.S. I know it might be tricky to do the necessary sanity checks there. But maybe there's a way to detect if the Core thread is gone and a place for the relevant assertion in debug builds?

@alexcrichton
Copy link
Contributor

@ArtemGr note that this is planned to be fixed in #17, it's just a breaking change so we're holding off until 0.2.

@ArtemGr
Copy link

ArtemGr commented Mar 29, 2017

@alexcrichton Awesome!

alexcrichton added a commit to alexcrichton/tokio that referenced this issue Dec 5, 2017
This commit is targeted at solving tokio-rs/tokio-core#12 and incorporates the
solution from tokio-rs/tokio-core#17. Namely the `need_read` and `need_write`
functions on `PollEvented` now return an error when the connected reactor has
gone away and the task cannot be blocked. This will typically naturally
translate to errors being returned by various connected I/O objects and should
help tear down the world in a clean-ish fashion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants