Set default Worker TTL #6148

mrocklin · 2022-04-16T22:03:56Z

Today workers heartbeat to the scheduler. If the scheduler doesn't hear back from them in a certain amount of tiime, the scheduler can ask the nanny to kill the worker, or just give up on the worker.

We have this code already, and it is configurable for how long the time to live (TTL) should be. Today the limit is set at infinity. We should maybe consider something more conservative.

cc @fjetter

This has been brought up many times. Some folks don't like this idea because they have computations that take a long time and hold the GIL (this is a valid counter-argument). We've avoided making a decision here in the past. Maybe we should set a reasonable target, like a minute or five minutes or something. We would warn loudly about what's happening and point people to the config option on how to change the behavior.

This was slightly inspired by (but does not fix) #6110

gjoseph92 · 2022-04-18T16:58:30Z

I was surprised to find that the default was infinity right now.

they have computations that take a long time and hold the GIL

This doesn't feel like a good counter-argument to me. Most other things in worker code aren't robust to the event loop being blocked for multiple minutes, and that's not something we should have to design for. More isolation between user code and worker internals would someday be good so this can't happen, though that won't happen anytime soon. But as things stand now, I'd consider that user error. Good Python code (whether in Dask or not) shouldn't be holding the GIL for minutes at a time, and indeed pure Python code can't (default switch interval is 5ms). So I'd think this comes from C extensions which don't release the GIL, which indicates a very advanced use case.

If this is actually the main concern, then using processes instead of threads #5319 (or just setting distributed.scheduler.worker-ttl: None explicitly) would probably be a fine way to address this niche use case.

mrocklin · 2022-04-18T17:03:17Z

Yeah, to be clear I think that it's a valid use case (lots of code just links out to Fortran/C, and does so badly) but I think that those folks should increase their TTL.

I don't need to be convinced that the counterargument I pose above is a bad one. I'm simply raising it as a common counter-argument, one that I think we should override.

fjetter · 2022-04-25T09:59:01Z

I'm fine with a default TTL. Users who are locking the GIL are typically aware of this since we're issuing many warnings.

We may want to extend the This is often caused by long-running GIL-holding with some additional information, e.g. point to a page on our docs with a few "important config knobs for GIL holding workloads"

Of course, this is a hard breaking change. The one thing that concerns me about these hard breaking changes is that we don't have a nice way to communicate this right now. We should at least highlight this in the next changelog.

like a minute or five minutes or something

I feel a minute might be too strict. Five seems OK but I can't provide a very solid argument for either.

gjoseph92 · 2022-04-25T21:58:15Z

I feel a minute might be too strict. Five seems OK but I can't provide a very solid argument for either.

I'd think about how long you, as a naive user, would be willing to wait for something like #6110 to resolve itself before thinking "this is deadlocked" and giving up. Five minutes feels like a long time to me.

We should be heartbeating every second, so missing 60 of them seems like a bad enough sign to me.

If normal workloads are causing workers to miss 60 heartbeats, that's probably another topic we should look into? (For example #5258 (comment).)

mrocklin · 2022-04-25T22:03:05Z

My intuition is similar to Gabe's. However, I'm also happy to introduce this slowly. I suggest that we start with five minutes, release, and see what happens. Then if we're happy with the result let's bring it down to two minutes, and then eventually to one minute.

…

On Mon, Apr 25, 2022, 4:58 PM Gabe Joseph ***@***.***> wrote: I feel a minute might be too strict. Five seems OK but I can't provide a very solid argument for either. I'd think about how long you, as a naive user, would be willing to wait for something like #6110 <#6110> to resolve itself before thinking "this is deadlocked" and giving up. Five minutes feels like a long time to me. We should be heartbeating every second <https://github.com/dask/distributed/blob/198522bf767d7d8c1927df07951cd34a79109bd0/distributed/worker.py#L454>, so missing 60 of them seems like a bad enough sign to me. If normal workloads are causing workers to miss 60 heartbeats, that's probably another topic we should look into? (For example #5258 (comment) <#5258 (comment)>.) — Reply to this email directly, view it on GitHub <#6148 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFO6SV7FM2J2RBUFGTVG4IQFANCNFSM5TS5XJVQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Fixes dask#6148

mrocklin · 2022-04-25T22:20:53Z

Here is a PR for the 5m TTL: #6200

…

On Mon, Apr 25, 2022 at 5:02 PM Matthew Rocklin ***@***.***> wrote: My intuition is similar to Gabe's. However, I'm also happy to introduce this slowly. I suggest that we start with five minutes, release, and see what happens. Then if we're happy with the result let's bring it down to two minutes, and then eventually to one minute. On Mon, Apr 25, 2022, 4:58 PM Gabe Joseph ***@***.***> wrote: > I feel a minute might be too strict. Five seems OK but I can't provide a > very solid argument for either. > > I'd think about how long you, as a naive user, would be willing to wait > for something like #6110 > <#6110> to resolve itself > before thinking "this is deadlocked" and giving up. Five minutes feels like > a long time to me. > > We should be heartbeating every second > <https://github.com/dask/distributed/blob/198522bf767d7d8c1927df07951cd34a79109bd0/distributed/worker.py#L454>, > so missing 60 of them seems like a bad enough sign to me. > > If normal workloads are causing workers to miss 60 heartbeats, that's > probably another topic we should look into? (For example #5258 (comment) > <#5258 (comment)> > .) > > — > Reply to this email directly, view it on GitHub > <#6148 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTFO6SV7FM2J2RBUFGTVG4IQFANCNFSM5TS5XJVQ> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

This was referenced Apr 19, 2022

Computation deadlocks due to worker rapidly running out of memory instead of spilling #6110

Closed

Deadlock stealing a resumed task #6159

Closed

mrocklin added a commit to mrocklin/distributed that referenced this issue Apr 25, 2022

Set a five minute TTL for Dask workers

d94ab9a

Fixes dask#6148

mrocklin mentioned this issue Apr 25, 2022

Set a five minute TTL for Dask workers #6200

Merged

fjetter closed this as completed in #6200 Apr 26, 2022

rlratzel mentioned this issue Oct 3, 2022

Various cugraph algos do not release the GIL during long-running C++/CUDA calls rapidsai/cugraph#2768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set default Worker TTL #6148

Set default Worker TTL #6148

mrocklin commented Apr 16, 2022

gjoseph92 commented Apr 18, 2022

mrocklin commented Apr 18, 2022

fjetter commented Apr 25, 2022

gjoseph92 commented Apr 25, 2022

mrocklin commented Apr 25, 2022 via email

mrocklin commented Apr 25, 2022 via email

Set default Worker TTL #6148

Set default Worker TTL #6148

Comments

mrocklin commented Apr 16, 2022

gjoseph92 commented Apr 18, 2022

mrocklin commented Apr 18, 2022

fjetter commented Apr 25, 2022

gjoseph92 commented Apr 25, 2022

mrocklin commented Apr 25, 2022 via email

mrocklin commented Apr 25, 2022 via email