"Worker process died unexpectedly" when tasks don't release GIL #6324

bnaul · 2022-05-11T18:16:40Z

In distributed==2022.5.0 we started seeing workers go down repeatedly with the message "Worker process died unexpectedly". I traced it back to #6200 and connected the dots that these tasks hold the GIL for long periods of time (one example is a third party numerical optimization solver that we call out to).

I follow the reasoning of the change, but I do think this has the potential to be a pretty disruptive change for some users, especially given the vagueness of the relevant logs: I put this together quickly because I had been browsing the issues recently and had some inkling that this had been discussed (but didn't know a change was already merged+released), but I think for users less familiar with distributed this might be very challenging to get to the bottom of. At minimum I think adding some messaging that points to the worker-ttl variable would be helpful (admittedly the "blah has been holding the GIL for n seconds" message is an existing clue but personally I tune those out because we see them so often).

cc @mrocklin @gjoseph92

The text was updated successfully, but these errors were encountered:

bnaul · 2022-05-11T18:22:19Z

One other thing I'll throw out: Prefect's Zombie Killer handles this in a more robust but obviously much heavier way: it spawns a heartbeat subprocess that pings their API periodically. There's some discussion here but the gist is

The reasoning is that the old heartbeat thread could not fire if the user's task held onto the GIL, and might have even resulted in threading deadlocks. The new implementation avoids the GIL entirely and runs in a very lightweight Popen call.

My understanding of all this is pretty superficial but I'm wondering if maybe the nanny process could serve some kind of similar role in this context and avoid these issues with the GIL? cc @jcrist I think you were involved with the zombie killer stuff as well. Don't wanna derail the main discussion above but just wanted to mention this.

mrocklin · 2022-05-11T18:52:25Z

We could have the nanny heartbeat as well as the worker. Question, what action should we take if the worker stopped responding, but the nanny didn't?

gjoseph92 · 2022-05-11T19:06:24Z

@bnaul do your tasks actually hold the GIL continuously for >5min? That's surprising to me. Pure Python code is extremely unlikely to continuously hold the GIL for 5min, though due to the convoy effect, it can still gum up the event loop #6325. Usually, if you're calling some C code (like a third-party numerical optimization solver), you'd just release the GIL while that's running. Unless you're intentionally using ctypes.PyDLL, ctypes calls will release the GIL automatically. I'd imagine any C/Cython libraries that are doing heavy numerical operations would make sure to release the GIL, though I suppose they could be buggy and not do this.

My point being, I'd like to understand how your tasks are actually managing to prevent the worker from talking to the scheduler for 5min straight. That's an extremely long time for an event loop to be blocked. It is not reasonable to even expect the worker to function properly if that's happening, IMO.

bnaul · 2022-05-11T19:46:20Z

I can try to dig in more later this week, but in this case the offending task was doing geospatial stuff with geopandas/pygeos/geos. Another common culprit is cvxpy.

gjoseph92 · 2022-05-11T19:59:16Z

@bnaul if you can run py-spy in native mode on the workers while this is happening, I'd be really interested: https://github.com/gjoseph92/scheduler-profilers. (Note that since you're running on k8s I think, you'll need to be sure to set seccomp privileges properly. If you're able to run on Coiled, we already set this; it'll work out of the box.)

My guess is the GIL isn't actually held for 5min straight, but for a reasonable amount of time, and the lack of isolation between the worker and the code it's running is hurting us due to the convoy effect. Hope to be proven wrong though.

If this is the case ("reasonable" code is blocking the event loop enough to trigger the TTL), then I think we need to consider rolling back this change to the default, or making it much longer. If your code is actually holding the GIL for 5min straight, I'd consider it an edge case, and want to add better error messages, but not necessarily change the TTL.

gjoseph92 · 2022-05-11T20:11:55Z

At minimum I think adding some messaging that points to the worker-ttl variable would be helpful

Agreed. I'm disappointed you even saw Worker process died unexpectedly, and definitely concerned if that's the only log message you saw. (The worker shutdown should not have been unexpected to the Nanny; Worker.close asks the Nanny to close gracefully). Something to think about here is that if the worker event loop is heavily blocked by your tasks, it's possible that various 30s timeouts in Worker.close (which seem reasonable with a functioning event loop) are failing, turning a graceful close into an un-graceful close. (There's not even try-except logic for these timeouts in close, or a @fail_hard, so I don't even know what happens when they fail.)

So yes, we could have the scheduler include a reason message when it tells a worker to close, and log that on the worker, which could help make it clearer what's going on. But if Worker.close is itself not working properly because of the blocked event loop, we might have bigger things to deal with, and I wouldn't be sure the message will always make it to the worker reliably.

bnaul · 2022-05-12T21:37:15Z

haven't tried pyspy yet (I think we did get it working when looking into a deadlock though?) but a little bit more detail: the task that was exceeding 5m of unresponsiveness was writing a very large .gpkg file with geopandas/fiona/GDAL. I can confirm that "Last Seen" in the Info tab was creeping up above 3 minutes (didn't see it get to 4m59s or whatever).

@jorisvandenbossche maybe you'd some insight into whether the following could apply to fiona+GDAL:

I'd imagine any C/Cython libraries that are doing heavy numerical operations would make sure to release the GIL, though I suppose they could be buggy and not do this.

If you think there might also be a relevant issue to raise over there I could try to do so.

mrocklin · 2022-05-13T00:07:43Z

Early in Dask's development this sort of thing used to happen a lot. We then asked a lot of libraries to add with nogil: blocks in Cython (or explicit release/acquire calls in C Extensions) and the world improved significantly. I'm not sure I ever tried to push fiona on this, but historically folks have been happy to make the changes.

gjoseph92 mentioned this issue May 11, 2022

Worker event loop performance hampered by GIL/convoy effect #6325

Open

fjetter added the discussion Discussing a topic with no specific actions yet label Jun 13, 2022

dyang37 mentioned this issue Jul 27, 2022

Dask issue: job cancelled cabouman/mbircone#69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Worker process died unexpectedly" when tasks don't release GIL #6324

"Worker process died unexpectedly" when tasks don't release GIL #6324

bnaul commented May 11, 2022

bnaul commented May 11, 2022

mrocklin commented May 11, 2022

gjoseph92 commented May 11, 2022

bnaul commented May 11, 2022

gjoseph92 commented May 11, 2022

gjoseph92 commented May 11, 2022

bnaul commented May 12, 2022

mrocklin commented May 13, 2022

"Worker process died unexpectedly" when tasks don't release GIL #6324

"Worker process died unexpectedly" when tasks don't release GIL #6324

Comments

bnaul commented May 11, 2022

bnaul commented May 11, 2022

mrocklin commented May 11, 2022

gjoseph92 commented May 11, 2022

bnaul commented May 11, 2022

gjoseph92 commented May 11, 2022

gjoseph92 commented May 11, 2022

bnaul commented May 12, 2022

mrocklin commented May 13, 2022