-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock in tokio-1.0 #3493
Comments
Are you able to get some sort of backtrace or line number for the snippet of assembly? |
I don't know how... |
My problem is not related to dead lock (as I think), but still worth to mention: #3500 |
I think I hit the same bug. Upgrading from tokio 0.2 to 1.0 went relatively smooth, just in one part in the application, when performing a certain slightly more complex request, it started to hang with 100 % CPU on one core and never returned. Today, I finally found the reason in the cooperative scheduling in Lines 97 to 99 in cc97fb8
This bug can only be found in sufficiently large projects that have deeply nested futures and exhaust the initial budget (a backtrace I captured had 218 stackframes). The module descriptions mentions that “voluntary yield points should be placed after at least some work has been done”. However, that is not always obeyed, leading to tasks being polled and doing nothing for eternity. One of the places I found is tokio::sleep: tokio/tokio/src/time/driver/sleep.rs Lines 216 to 221 in cc97fb8
My original code did not contain a sleep, so there must to be other places too. |
Thank you @Flakebi. The coop system consuming permits without doing any work certainly sounds like a problem that can cause a deadlock. Thanks for pointing it out! I have opened a new issue for tracking this specifically, a link to which you can find right above this reply. @tancehao Can you check whether your deadlock is caused by the same issue by replacing |
I replied over in #3502, but the indicated code in sleep doesn't look wrong to me. If I suspect the root cause here is instead that there is some code in the application in question that will not return Could it be that some dependency in your stack still uses |
Thanks for the fast answers! The behavior I see is:
I guess I don’t understand how |
So,
Yes, that's right, though that should result in
Ah, no, quite the opposite. The default behavior is that dropping You can see the relevant code here: Lines 156 to 173 in 3e5a0a7
It's the other way around. If you poll sleeps that are always ready, every time they're ready your budget gets decremented. If, on the other hand, you create a very long sleep that is basically never ready, then polling it does not consume any of your budget. |
Ah, right, I knew I missed something. Thanks for the explanation! It seems like the |
I got a little further, in my case it is not actually a deadlock, it just takes a minute to finish and I never waited that long.
Reading the futures issue @jonhoo linked above, I think I can infer the reason. If the budget is exhausted, tokio’s leaf futures return According to the stacktrace I captured, there are 6 nested layers of |
I see. Well, That said, I have considered if the coop system should turn itself off automatically if the task keeps polling the future. Thoughts? |
Ooof, yeah, that seems like a likely culprit @Flakebi! The longer-term solution here is that every "sub-executor" (like Shorter-term, I think @Darksonn is right that we need some intermediate solution. Unfortunately, I think the proposed solution of "disable budgets if a future is polled after yielding due to coop" won't quite work, since that will be a very common case in practice. Consider the One thought I've had in the past is that instead of coop making a future return |
IMO |
I haven't read all of this thread and related threads, but I tend to agree with this. @jonhoo @Flakebi Is a patch like this enough? or am I missing something? I don't have much time this & next week, but if the above patch works or someone can write a correct fix, I would accept a PR to fix this on futures-rs side. |
@taiki-e I tested your patch and it works. Also looks quite elegant to me :) |
Yeah, me too.
So, that will only kind of work. It is definitely an improvement (we'll no longer always re-poll 32 times), but the more fundamental "we keep re-polling" issue is still present. Also, dependning on the ordering of the futures in the list, that might cause you to re-poll future while not polling another future that's ready I think? |
Yeah, I agree that the patch does not solve the fundamental problem. (I think to fix the fundamental problem it needs the longer-term solution that @carllerche mentioned)
Currently, the patch is using |
@Darksonn I'll have a try |
The problem reported by @Flakebi (#3493 (comment)) has been fixed in @tancehao: Could you see if the problem you encountered also happen with the latest |
@Darksonn I've upgraded my tokio library to 1.5, but suprisingly I found the problem still exists. |
If the watch receiver in So in conclusion, if In general, it seems like you should be using a |
Version
│ │ ├── tokio v1.0.2
│ │ │ └── tokio-macros v1.0.0
│ │ ├── tokio-util v0.6.1
│ │ │ ├── tokio v1.0.2 ()
│ │ │ └── tokio-stream v0.1.2
│ │ │ └── tokio v1.0.2 ()
│ ├── tokio v1.0.2 ()
├── tokio v1.0.2 ()
├── tokio-compat-03 v0.0.0
├── tokio-stream v0.1.2 ()
├── tokio-util v0.6.1 ()
Platform
linux 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux
Description
My program pulls data from a redis server and send them to another one.
There are many futures —— some of them pull data from the source, some push to the target, and some print stats into stdout periodically(5 seconds). And when pulling and pushing data, the program modifies some metrics managed by prometheus-0.1.0.
The program works well for months under tokio-0.3.0, but when I upgraded to tokio-1.0, deadlock happens. When it is blocked, the program no longer print stats, meaning that some of the futures are not polled by the tokio runtime(multi-threaded mode).
I executed
perf top
, and here is the output:and here is what it showed when I selected the first line and then it's annotation:
These statistics seldom change for a long time.
So what happend in my process. What's wrong with the atomic operations?
The text was updated successfully, but these errors were encountered: