Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown issue between smol 0.1.12 vs 0.1.13 #177

Closed
Enrico2 opened this issue Jun 24, 2020 · 10 comments
Closed

Unknown issue between smol 0.1.12 vs 0.1.13 #177

Enrico2 opened this issue Jun 24, 2020 · 10 comments

Comments

@Enrico2
Copy link

Enrico2 commented Jun 24, 2020

Hello,

We've recently discovered a bizarre situation involving smol and a minor version update that broke stuff.

The interesting bit is that we can consistently reproduce a test error if we have smol = "=0.1.13" but if we use smol = "=0.1.12", tests pass.

The details: It's a bit convoluted: we don't have a direct dependency on smol, we get it via async-std -- the core issue we've been seeing is that when setting up a MockServer (from wiremock) and trying to call it from an async unit test (using #[async_std::test]), it hangs and we get a timeout.

The bizarre thing was that everything was fine, and a couple of days later without any code changes on our end, tests started failing. It seems that Cargo brought in a newer smol (until recently, we didn't check in our Cargo.lock).

If you'd like to reproduce locally (mac or linux, not sure about windows):

git clone git@github.com:apollographql/rust.git apollo-rust
cd apollo-rust/cli
cargo test reports_session # this will pass
sed -i'.original' -e 's/=0.1.12/=0.1.13/g' Cargo.toml
cargo test reports_session # this will fail

This fails up to 0.1.18 as well.

I realize this is a few versions back, if you have any insight here that can be helpful, we'd appreciate it. Thanks!

@ghost
Copy link

ghost commented Jun 24, 2020

I poked around a little bit and figured that the tests in Apollo are using reqwest::blocking inside async code. Maybe there is blocking in some other places too. It turns out that previously tests simply got lucky - because the async-std scheduler is multithreaded, when a single async thread was blocked, other threads would keep running tasks and everything was OK.

The new scheduler is a bit smarter in that it doesn't run all executor threads at full speed if there isn't massive amounts of work to do. So e.g. if there are only two tasks to run, there's no point in running all executor threads at once - most likely one thread can run both tasks. So now if the first task blocks (and it seems to block on the reqwest call), the whole system can get stalled because the assumption is that running any task is effectively instantaneous.

I could "fix" this issue by making the scheduler a bit more aggressive, always assuming that any executor thread could get blocked on any task. However, that's not a real fix - it's merely a bandaid for an issue that is not in smol.

The real solution would be to:

  1. Do better diagnostics and print/log when running a task takes unusually long - maybe 100 ms or something like that.
  2. Do automatic blocking detection and spawn a new executor thread when one is blocked, like in New scheduler resilient to blocking async-rs/async-std#631

I find it amusing how that PR got so much pushback and negative publicity when apparently all async code blocks in some places, and it is the most effective solution to the problem :)

cc @dignifiedquire - I think this is now the third instance of the same kind of issue...

@sehz
Copy link

sehz commented Jun 24, 2020

How about make scheduler configurable?

  • no optimized
  • optimized

And give option to turn on diagnostic and automatic block detection (maybe as feature flag). I probably wouldn't want automatic block detection as default.

@Enrico2
Copy link
Author

Enrico2 commented Jun 24, 2020

Thanks for much for looking into this!

The new scheduler is a bit smarter in that it doesn't run all executor threads at full speed if there isn't massive amounts of work to do. So e.g. if there are only two tasks to run, there's no point in running all executor threads at once - most likely one thread can run both tasks. So now if the first task blocks (and it seems to block on the reqwest call), the whole system can get stalled because the assumption is that running any task is effectively instantaneous.

@stjepang I'm not sure I follow, is your hypothesis that this is being run in a single thread and stalls the system such that we hit the timeout? If that's the case, removing timeout restrictions and giving it enough time should eventually pass. This doesn't seem to be the case.
The symptom I'm seeing is the code hanging and never returning from MockServer until the timeout expires, even if I set it to a minute. The response should be instantaneous as the MockServer is running locally.

@ghost
Copy link

ghost commented Jun 24, 2020

Okay, I think we should make two changes right now to alleviate this issue:

  1. Notify worker threads before running any task. This way, whenever a task is available, at least one idle thread should be notified.
  2. Remove the single-task slot in front of every worker queue. Tasks is those slots are not stealable, which means they won't get run the worker is stalled.

After that, we should use log crate to issue a warning/debug/trace whenever a worker is stalled for 100 ms (or some other number). @sehz, does that sound reasonable to you?

@Enrico2 There are two tasks: one is running the MockServer and one is blocking on reqwest issued by session.report().

Since there are only two tasks (and not hundreds of tasks), the executor is running only one thread and all other threads are sleeping. The executor doesn't bother spinning up more than on thread because the assumption is that running those two tasks should be trivial work for just one thread.

That single running thread then polls the future that blocks on reqwest and we wait until the timeout. The other task with MockServer never gets the chance to run because the single executor thread is blocked until reqwest times out.

What I propose we do right now is pessimistically assume that any future might block for a long time. So before we poll that future with reqwest, we should wake up at least one other executor thread to run MockServer in case the first thread gets blocked on reqwest.

@Enrico2
Copy link
Author

Enrico2 commented Jun 24, 2020

Ah! Got it. Thanks so much for the details!

@sehz
Copy link

sehz commented Jun 24, 2020

That sounds good as long as it is configurable. it could be expected behavior for some.

@KalitaAlexey
Copy link

KalitaAlexey commented Jun 25, 2020

I faced this issue too. I used UdpSocket::recv_from to get data and my another task got blocked. I had to use UdpSocket::set_read_timeout. Even though it's easy to "fix" for me, I spent some time trying to understand.

select! {
    msg = receiver.next().fuse() => match msg {
        Some(msg) => self.on_msg(msg).await?,
        _ => {}
    },
    res = async {
        let socket = self.socket.lock().await;
        socket.recv_from(&mut buffer)
    }.fuse() => match res {
        Ok((n, addr)) if n > 0 => self.on_data(&addr, &buffer[..n]).await?,
        Err(e) if e.kind() == std::io::ErrorKind::TimedOut => task::yield_now().await,
        _ => {}
    },
}

@ghost
Copy link

ghost commented Jun 25, 2020

@KalitaAlexey Just to double-check I understood - the socket.recv_from(&mut buffer) operation is blocking, is that correct?

@KalitaAlexey
Copy link

KalitaAlexey commented Jun 25, 2020

@stjepang, yes. It's in std, not in async-std. Actually there is async_std::net::UdpSocket. I'm almost certain I didn't see it yesterday.

@ghost
Copy link

ghost commented Jul 24, 2020

This is now fixed in smol 0.3

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants