Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set default Worker TTL #6148

Closed
mrocklin opened this issue Apr 16, 2022 · 6 comments · Fixed by #6200
Closed

Set default Worker TTL #6148

mrocklin opened this issue Apr 16, 2022 · 6 comments · Fixed by #6200

Comments

@mrocklin
Copy link
Member

Today workers heartbeat to the scheduler. If the scheduler doesn't hear back from them in a certain amount of tiime, the scheduler can ask the nanny to kill the worker, or just give up on the worker.

We have this code already, and it is configurable for how long the time to live (TTL) should be. Today the limit is set at infinity. We should maybe consider something more conservative.

cc @fjetter

This has been brought up many times. Some folks don't like this idea because they have computations that take a long time and hold the GIL (this is a valid counter-argument). We've avoided making a decision here in the past. Maybe we should set a reasonable target, like a minute or five minutes or something. We would warn loudly about what's happening and point people to the config option on how to change the behavior.

This was slightly inspired by (but does not fix) #6110

@gjoseph92
Copy link
Collaborator

I was surprised to find that the default was infinity right now.

they have computations that take a long time and hold the GIL

This doesn't feel like a good counter-argument to me. Most other things in worker code aren't robust to the event loop being blocked for multiple minutes, and that's not something we should have to design for. More isolation between user code and worker internals would someday be good so this can't happen, though that won't happen anytime soon. But as things stand now, I'd consider that user error. Good Python code (whether in Dask or not) shouldn't be holding the GIL for minutes at a time, and indeed pure Python code can't (default switch interval is 5ms). So I'd think this comes from C extensions which don't release the GIL, which indicates a very advanced use case.

If this is actually the main concern, then using processes instead of threads #5319 (or just setting distributed.scheduler.worker-ttl: None explicitly) would probably be a fine way to address this niche use case.

@mrocklin
Copy link
Member Author

Yeah, to be clear I think that it's a valid use case (lots of code just links out to Fortran/C, and does so badly) but I think that those folks should increase their TTL.

I don't need to be convinced that the counterargument I pose above is a bad one. I'm simply raising it as a common counter-argument, one that I think we should override.

@fjetter
Copy link
Member

fjetter commented Apr 25, 2022

I'm fine with a default TTL. Users who are locking the GIL are typically aware of this since we're issuing many warnings.

We may want to extend the This is often caused by long-running GIL-holding with some additional information, e.g. point to a page on our docs with a few "important config knobs for GIL holding workloads"

Of course, this is a hard breaking change. The one thing that concerns me about these hard breaking changes is that we don't have a nice way to communicate this right now. We should at least highlight this in the next changelog.


like a minute or five minutes or something

I feel a minute might be too strict. Five seems OK but I can't provide a very solid argument for either.

@gjoseph92
Copy link
Collaborator

I feel a minute might be too strict. Five seems OK but I can't provide a very solid argument for either.

I'd think about how long you, as a naive user, would be willing to wait for something like #6110 to resolve itself before thinking "this is deadlocked" and giving up. Five minutes feels like a long time to me.

We should be heartbeating every second, so missing 60 of them seems like a bad enough sign to me.

If normal workloads are causing workers to miss 60 heartbeats, that's probably another topic we should look into? (For example #5258 (comment).)

@mrocklin
Copy link
Member Author

mrocklin commented Apr 25, 2022 via email

mrocklin added a commit to mrocklin/distributed that referenced this issue Apr 25, 2022
@mrocklin
Copy link
Member Author

mrocklin commented Apr 25, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants