-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Worker process died unexpectedly" when tasks don't release GIL #6324
Comments
One other thing I'll throw out: Prefect's Zombie Killer handles this in a more robust but obviously much heavier way: it spawns a heartbeat subprocess that pings their API periodically. There's some discussion here but the gist is
My understanding of all this is pretty superficial but I'm wondering if maybe the nanny process could serve some kind of similar role in this context and avoid these issues with the GIL? cc @jcrist I think you were involved with the zombie killer stuff as well. Don't wanna derail the main discussion above but just wanted to mention this. |
We could have the nanny heartbeat as well as the worker. Question, what action should we take if the worker stopped responding, but the nanny didn't? |
@bnaul do your tasks actually hold the GIL continuously for >5min? That's surprising to me. Pure Python code is extremely unlikely to continuously hold the GIL for 5min, though due to the convoy effect, it can still gum up the event loop #6325. Usually, if you're calling some C code (like a third-party numerical optimization solver), you'd just release the GIL while that's running. Unless you're intentionally using My point being, I'd like to understand how your tasks are actually managing to prevent the worker from talking to the scheduler for 5min straight. That's an extremely long time for an event loop to be blocked. It is not reasonable to even expect the worker to function properly if that's happening, IMO. |
I can try to dig in more later this week, but in this case the offending task was doing geospatial stuff with geopandas/pygeos/geos. Another common culprit is cvxpy. |
@bnaul if you can run py-spy in native mode on the workers while this is happening, I'd be really interested: https://github.com/gjoseph92/scheduler-profilers. (Note that since you're running on k8s I think, you'll need to be sure to set seccomp privileges properly. If you're able to run on Coiled, we already set this; it'll work out of the box.) My guess is the GIL isn't actually held for 5min straight, but for a reasonable amount of time, and the lack of isolation between the worker and the code it's running is hurting us due to the convoy effect. Hope to be proven wrong though. If this is the case ("reasonable" code is blocking the event loop enough to trigger the TTL), then I think we need to consider rolling back this change to the default, or making it much longer. If your code is actually holding the GIL for 5min straight, I'd consider it an edge case, and want to add better error messages, but not necessarily change the TTL. |
Agreed. I'm disappointed you even saw So yes, we could have the scheduler include a |
haven't tried pyspy yet (I think we did get it working when looking into a deadlock though?) but a little bit more detail: the task that was exceeding 5m of unresponsiveness was writing a very large .gpkg file with geopandas/fiona/GDAL. I can confirm that "Last Seen" in the Info tab was creeping up above 3 minutes (didn't see it get to 4m59s or whatever). @jorisvandenbossche maybe you'd some insight into whether the following could apply to fiona+GDAL:
If you think there might also be a relevant issue to raise over there I could try to do so. |
Early in Dask's development this sort of thing used to happen a lot. We then asked a lot of libraries to add |
In
distributed==2022.5.0
we started seeing workers go down repeatedly with the message "Worker process died unexpectedly". I traced it back to #6200 and connected the dots that these tasks hold the GIL for long periods of time (one example is a third party numerical optimization solver that we call out to).I follow the reasoning of the change, but I do think this has the potential to be a pretty disruptive change for some users, especially given the vagueness of the relevant logs: I put this together quickly because I had been browsing the issues recently and had some inkling that this had been discussed (but didn't know a change was already merged+released), but I think for users less familiar with
distributed
this might be very challenging to get to the bottom of. At minimum I think adding some messaging that points to theworker-ttl
variable would be helpful (admittedly the "blah has been holding the GIL for n seconds" message is an existing clue but personally I tune those out because we see them so often).cc @mrocklin @gjoseph92
The text was updated successfully, but these errors were encountered: