-
-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set default Worker TTL #6148
Comments
I was surprised to find that the default was infinity right now.
This doesn't feel like a good counter-argument to me. Most other things in worker code aren't robust to the event loop being blocked for multiple minutes, and that's not something we should have to design for. More isolation between user code and worker internals would someday be good so this can't happen, though that won't happen anytime soon. But as things stand now, I'd consider that user error. Good Python code (whether in Dask or not) shouldn't be holding the GIL for minutes at a time, and indeed pure Python code can't (default switch interval is 5ms). So I'd think this comes from C extensions which don't release the GIL, which indicates a very advanced use case. If this is actually the main concern, then using processes instead of threads #5319 (or just setting |
Yeah, to be clear I think that it's a valid use case (lots of code just links out to Fortran/C, and does so badly) but I think that those folks should increase their TTL. I don't need to be convinced that the counterargument I pose above is a bad one. I'm simply raising it as a common counter-argument, one that I think we should override. |
I'm fine with a default TTL. Users who are locking the GIL are typically aware of this since we're issuing many warnings. We may want to extend the Of course, this is a hard breaking change. The one thing that concerns me about these hard breaking changes is that we don't have a nice way to communicate this right now. We should at least highlight this in the next changelog.
I feel a minute might be too strict. Five seems OK but I can't provide a very solid argument for either. |
I'd think about how long you, as a naive user, would be willing to wait for something like #6110 to resolve itself before thinking "this is deadlocked" and giving up. Five minutes feels like a long time to me. We should be heartbeating every second, so missing 60 of them seems like a bad enough sign to me. If normal workloads are causing workers to miss 60 heartbeats, that's probably another topic we should look into? (For example #5258 (comment).) |
My intuition is similar to Gabe's. However, I'm also happy to introduce
this slowly. I suggest that we start with five minutes, release, and see
what happens. Then if we're happy with the result let's bring it down to
two minutes, and then eventually to one minute.
…On Mon, Apr 25, 2022, 4:58 PM Gabe Joseph ***@***.***> wrote:
I feel a minute might be too strict. Five seems OK but I can't provide a
very solid argument for either.
I'd think about how long you, as a naive user, would be willing to wait
for something like #6110 <#6110>
to resolve itself before thinking "this is deadlocked" and giving up. Five
minutes feels like a long time to me.
We should be heartbeating every second
<https://github.com/dask/distributed/blob/198522bf767d7d8c1927df07951cd34a79109bd0/distributed/worker.py#L454>,
so missing 60 of them seems like a bad enough sign to me.
If normal workloads are causing workers to miss 60 heartbeats, that's
probably another topic we should look into? (For example #5258 (comment)
<#5258 (comment)>.)
—
Reply to this email directly, view it on GitHub
<#6148 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFO6SV7FM2J2RBUFGTVG4IQFANCNFSM5TS5XJVQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Here is a PR for the 5m TTL: #6200
…On Mon, Apr 25, 2022 at 5:02 PM Matthew Rocklin ***@***.***> wrote:
My intuition is similar to Gabe's. However, I'm also happy to introduce
this slowly. I suggest that we start with five minutes, release, and see
what happens. Then if we're happy with the result let's bring it down to
two minutes, and then eventually to one minute.
On Mon, Apr 25, 2022, 4:58 PM Gabe Joseph ***@***.***>
wrote:
> I feel a minute might be too strict. Five seems OK but I can't provide a
> very solid argument for either.
>
> I'd think about how long you, as a naive user, would be willing to wait
> for something like #6110
> <#6110> to resolve itself
> before thinking "this is deadlocked" and giving up. Five minutes feels like
> a long time to me.
>
> We should be heartbeating every second
> <https://github.com/dask/distributed/blob/198522bf767d7d8c1927df07951cd34a79109bd0/distributed/worker.py#L454>,
> so missing 60 of them seems like a bad enough sign to me.
>
> If normal workloads are causing workers to miss 60 heartbeats, that's
> probably another topic we should look into? (For example #5258 (comment)
> <#5258 (comment)>
> .)
>
> —
> Reply to this email directly, view it on GitHub
> <#6148 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTFO6SV7FM2J2RBUFGTVG4IQFANCNFSM5TS5XJVQ>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
Today workers heartbeat to the scheduler. If the scheduler doesn't hear back from them in a certain amount of tiime, the scheduler can ask the nanny to kill the worker, or just give up on the worker.
We have this code already, and it is configurable for how long the time to live (TTL) should be. Today the limit is set at infinity. We should maybe consider something more conservative.
cc @fjetter
This has been brought up many times. Some folks don't like this idea because they have computations that take a long time and hold the GIL (this is a valid counter-argument). We've avoided making a decision here in the past. Maybe we should set a reasonable target, like a minute or five minutes or something. We would warn loudly about what's happening and point people to the config option on how to change the behavior.
This was slightly inspired by (but does not fix) #6110
The text was updated successfully, but these errors were encountered: