Unknown issue between smol `0.1.12` vs `0.1.13` #177

Enrico2 · 2020-06-24T14:49:30Z

Hello,

We've recently discovered a bizarre situation involving smol and a minor version update that broke stuff.

The interesting bit is that we can consistently reproduce a test error if we have smol = "=0.1.13" but if we use smol = "=0.1.12", tests pass.

The details: It's a bit convoluted: we don't have a direct dependency on smol, we get it via async-std -- the core issue we've been seeing is that when setting up a MockServer (from wiremock) and trying to call it from an async unit test (using #[async_std::test]), it hangs and we get a timeout.

The bizarre thing was that everything was fine, and a couple of days later without any code changes on our end, tests started failing. It seems that Cargo brought in a newer smol (until recently, we didn't check in our Cargo.lock).

If you'd like to reproduce locally (mac or linux, not sure about windows):

git clone git@github.com:apollographql/rust.git apollo-rust
cd apollo-rust/cli
cargo test reports_session # this will pass
sed -i'.original' -e 's/=0.1.12/=0.1.13/g' Cargo.toml
cargo test reports_session # this will fail

This fails up to 0.1.18 as well.

I realize this is a few versions back, if you have any insight here that can be helpful, we'd appreciate it. Thanks!

The text was updated successfully, but these errors were encountered:

ghost · 2020-06-24T18:15:36Z

I poked around a little bit and figured that the tests in Apollo are using reqwest::blocking inside async code. Maybe there is blocking in some other places too. It turns out that previously tests simply got lucky - because the async-std scheduler is multithreaded, when a single async thread was blocked, other threads would keep running tasks and everything was OK.

The new scheduler is a bit smarter in that it doesn't run all executor threads at full speed if there isn't massive amounts of work to do. So e.g. if there are only two tasks to run, there's no point in running all executor threads at once - most likely one thread can run both tasks. So now if the first task blocks (and it seems to block on the reqwest call), the whole system can get stalled because the assumption is that running any task is effectively instantaneous.

I could "fix" this issue by making the scheduler a bit more aggressive, always assuming that any executor thread could get blocked on any task. However, that's not a real fix - it's merely a bandaid for an issue that is not in smol.

The real solution would be to:

Do better diagnostics and print/log when running a task takes unusually long - maybe 100 ms or something like that.
Do automatic blocking detection and spawn a new executor thread when one is blocked, like in New scheduler resilient to blocking async-rs/async-std#631

I find it amusing how that PR got so much pushback and negative publicity when apparently all async code blocks in some places, and it is the most effective solution to the problem :)

cc @dignifiedquire - I think this is now the third instance of the same kind of issue...

sehz · 2020-06-24T18:48:21Z

How about make scheduler configurable?

no optimized
optimized

And give option to turn on diagnostic and automatic block detection (maybe as feature flag). I probably wouldn't want automatic block detection as default.

Enrico2 · 2020-06-24T19:44:36Z

Thanks for much for looking into this!

The new scheduler is a bit smarter in that it doesn't run all executor threads at full speed if there isn't massive amounts of work to do. So e.g. if there are only two tasks to run, there's no point in running all executor threads at once - most likely one thread can run both tasks. So now if the first task blocks (and it seems to block on the reqwest call), the whole system can get stalled because the assumption is that running any task is effectively instantaneous.

@stjepang I'm not sure I follow, is your hypothesis that this is being run in a single thread and stalls the system such that we hit the timeout? If that's the case, removing timeout restrictions and giving it enough time should eventually pass. This doesn't seem to be the case.
The symptom I'm seeing is the code hanging and never returning from MockServer until the timeout expires, even if I set it to a minute. The response should be instantaneous as the MockServer is running locally.

ghost · 2020-06-24T19:53:24Z

Okay, I think we should make two changes right now to alleviate this issue:

Notify worker threads before running any task. This way, whenever a task is available, at least one idle thread should be notified.
Remove the single-task slot in front of every worker queue. Tasks is those slots are not stealable, which means they won't get run the worker is stalled.

After that, we should use log crate to issue a warning/debug/trace whenever a worker is stalled for 100 ms (or some other number). @sehz, does that sound reasonable to you?

@Enrico2 There are two tasks: one is running the MockServer and one is blocking on reqwest issued by session.report().

Since there are only two tasks (and not hundreds of tasks), the executor is running only one thread and all other threads are sleeping. The executor doesn't bother spinning up more than on thread because the assumption is that running those two tasks should be trivial work for just one thread.

That single running thread then polls the future that blocks on reqwest and we wait until the timeout. The other task with MockServer never gets the chance to run because the single executor thread is blocked until reqwest times out.

What I propose we do right now is pessimistically assume that any future might block for a long time. So before we poll that future with reqwest, we should wake up at least one other executor thread to run MockServer in case the first thread gets blocked on reqwest.

Enrico2 · 2020-06-24T20:06:54Z

Ah! Got it. Thanks so much for the details!

sehz · 2020-06-24T20:35:35Z

That sounds good as long as it is configurable. it could be expected behavior for some.

KalitaAlexey · 2020-06-25T21:49:22Z

I faced this issue too. I used UdpSocket::recv_from to get data and my another task got blocked. I had to use UdpSocket::set_read_timeout. Even though it's easy to "fix" for me, I spent some time trying to understand.

select! {
    msg = receiver.next().fuse() => match msg {
        Some(msg) => self.on_msg(msg).await?,
        _ => {}
    },
    res = async {
        let socket = self.socket.lock().await;
        socket.recv_from(&mut buffer)
    }.fuse() => match res {
        Ok((n, addr)) if n > 0 => self.on_data(&addr, &buffer[..n]).await?,
        Err(e) if e.kind() == std::io::ErrorKind::TimedOut => task::yield_now().await,
        _ => {}
    },
}

ghost · 2020-06-25T22:13:29Z

@KalitaAlexey Just to double-check I understood - the socket.recv_from(&mut buffer) operation is blocking, is that correct?

KalitaAlexey · 2020-06-25T22:14:27Z

@stjepang, yes. It's in std, not in async-std. Actually there is async_std::net::UdpSocket. I'm almost certain I didn't see it yesterday.

ghost · 2020-07-24T14:20:49Z

This is now fixed in smol 0.3

ghost closed this as completed Jul 24, 2020

ghost mentioned this issue Sep 6, 2020

Deadlock with recursive task::block_on async-rs/async-std#644

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unknown issue between smol `0.1.12` vs `0.1.13` #177

Unknown issue between smol `0.1.12` vs `0.1.13` #177

Enrico2 commented Jun 24, 2020 •

edited

Loading

ghost commented Jun 24, 2020 •

edited by ghost

Loading

sehz commented Jun 24, 2020

Enrico2 commented Jun 24, 2020

ghost commented Jun 24, 2020

Enrico2 commented Jun 24, 2020

sehz commented Jun 24, 2020

KalitaAlexey commented Jun 25, 2020 •

edited

Loading

ghost commented Jun 25, 2020

KalitaAlexey commented Jun 25, 2020 •

edited

Loading

ghost commented Jul 24, 2020

Unknown issue between smol 0.1.12 vs 0.1.13 #177

Unknown issue between smol 0.1.12 vs 0.1.13 #177

Comments

Enrico2 commented Jun 24, 2020 • edited Loading

ghost commented Jun 24, 2020 • edited by ghost Loading

sehz commented Jun 24, 2020

Enrico2 commented Jun 24, 2020

ghost commented Jun 24, 2020

Enrico2 commented Jun 24, 2020

sehz commented Jun 24, 2020

KalitaAlexey commented Jun 25, 2020 • edited Loading

ghost commented Jun 25, 2020

KalitaAlexey commented Jun 25, 2020 • edited Loading

ghost commented Jul 24, 2020

Unknown issue between smol `0.1.12` vs `0.1.13` #177

Unknown issue between smol `0.1.12` vs `0.1.13` #177

Enrico2 commented Jun 24, 2020 •

edited

Loading

ghost commented Jun 24, 2020 •

edited by ghost

Loading

KalitaAlexey commented Jun 25, 2020 •

edited

Loading

KalitaAlexey commented Jun 25, 2020 •

edited

Loading