Worker event loop performance hampered by GIL/convoy effect #6325

gjoseph92 · 2022-05-11T19:03:28Z

Opening this to have a clearer issue to link to in other places. In #5258 (comment), we discovered that the worker's event loop performance could be heavily impacted when:

User task code, running in threads, holds the GIL (even if the tasks are quick)
Workload involves data transfer to other workers

For example, while running tasks involving lots of np.concatenates (which doesn't release the GIL), while simultaneously trying to send data to other workers, we found that 71% of the time, the worker's event loop was blocked (not idle) by the GIL.

This is the "convoy effect", a longstanding issue in CPython: https://bugs.python.org/issue7946. Basically, the non-blocking socket.send() call (running in the event loop) releases the GIL and does a non-blocking write to the socket, which is nearly instantaneous. But then it needs to re-acquire the GIL, which some other thread now holds, and will hold for potentially a long time (the default thread-switch interval is 5ms). So this "non-blocking" socket.send is, effectively, blocking. If most of what the event loop is trying to do is send data/messages to other workers/the scheduler, then most of the time, the event loop will actually be blocked.

See the "GIL Gotchas" section in https://coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept/ for an intuitive explanation of this involving toothbrushes.

This is clearly bad from a performance standpoint (it massively slows down data transfer between workers).

An open question is whether it's also a stability issue. I wouldn't be surprised to see messages like Event loop was unresponsive in Worker for 3.28s. This is often caused by long-running GIL-holding functions in worker logs, even when your tasks aren't long-running GIL-holding functions, but just GIL-holding functions at all.

No async code will perform well if the event loop is gummed up. I don't think the worker is any exception. I don't think that the worker is, or should have to be, designed to work reliably when the event loop is highly unresponsive. Instead, I think we should focus on ways to protect the event loop from being blocked. (The main way to do this is run tasks in subprocesses instead of threads, which I think is a good idea for all sorts of reasons.)

I'm not sure exactly how an unresponsive event loop impacts stability. In theory, it shouldn't matter—everything should still happen, in the same order, just really, really slowly. However, any sane-seeming timeouts will go out the window if the event loop is running 1,000x slower than expected. So in practice, I'm not surprised to see a blocked event loop causing things like #6324.

The text was updated successfully, but these errors were encountered:

fjetter · 2022-05-16T09:39:02Z

Do we have a reproducer for this effect? I encourage us to think about actual real world impact. The theory is sound and intuition tells us that a "blocked event loop is bad" but do we see this matter in reality? E.g. a real world impact of this would be a task stream with a lot of white space caused by workers not communicating sufficiently. Just having a blocked loop and blocked worker communication doesn't necessarily mean that we're bottlenecked. Workers are fetching data all the time, even prematurely.

FWIW: There is huge value in this issue, just from an informational POV. I've seen SSL read pop up in profiles countless times and this offers an explanation. The question is whether this is

ericman93 · 2022-05-21T10:29:30Z

In my case, the scheduler isn't working. I have multiple workers that are not processing anything, and the tasks are piling up
Looking at the logs, I see the unresponsive eventloop log, but with much longer unresponsive time

distributed.core - INFO - Event loop was unresponsive in Scheduler for 9409.37s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

switching from pd.DataFrame to dd.DataFrame might help?

crusaderky · 2022-05-22T10:40:00Z

distributed.core - INFO - Event loop was unresponsive in Scheduler for 9409.37s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

This looks like a laptop that went into standby to me?

ericman93 · 2022-05-22T12:46:59Z

@crusaderky unfortunately that's a k8s pod running dask-scheduler

fjetter · 2022-05-23T15:00:07Z

@ericman93 I believe what you are experiencing is something else. If your event loop is blocked for that long, it is indeed bad but I'm very certain that this does not connect to GIL convoy effect. Would you please open a new issue describing your problem, setup and ideally provide some code

crusaderky · 2022-05-24T10:17:48Z

That line can be slightly misleading.
It is accurate for sync tasks. If you schedule an async task, however, you don't necessarily need a GIL-holding function to cause that message; time.sleep(9409) will have the same effect since async tasks are run in the same event loop as everything else.

e.g.

c = distributed.Client("localhost:8786")
async def f():
    time.sleep(5)
c.submit(f).result()

2022-05-24 11:14:43,193 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.01s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

You would also get the same effect with a SingleThreadExecutor (no such class exists in the distributed package AFAIK, but it would make sense for debugging - it would be basically the same as pure dask with scheduler='single-threaded')

ericman93 · 2022-05-25T08:16:12Z

@crusaderky scheduler is blocked for the sleep time? Does the scheduler wait for the tasks to finish it its sync?

crusaderky · 2022-05-25T09:23:56Z

Apologies, I misread and didn't notice it was a scheduler log and not a worker log.

mrocklin · 2022-05-25T13:44:04Z

@crusaderky scheduler is blocked for the sleep time? Does the scheduler wait for the tasks to finish it its sync?

No. There is no reason within Dask why the scheduler would be blocked for multiple hours. This is unrelated to the GIL convoy effect. Please raise a different issue.

gjoseph92 added networking performance stability Issue or feature related to cluster stability (e.g. deadlock) labels May 11, 2022

gjoseph92 mentioned this issue May 11, 2022

"Worker process died unexpectedly" when tasks don't release GIL #6324

Open

This was referenced May 17, 2022

Task stuck in "processing" on closed worker #6263

Closed

Worker may message or reconnect to scheduler while worker is closing #6354

Closed

Do not filter tasks before gathering data #6371

Merged

ian-r-rose mentioned this issue Aug 1, 2022

PicklingError when using custom ProcessPoolExecutor #6803

Open

This was referenced Aug 16, 2022

[Discussion] Run all tasks in subprocesses #6889

Open

Benchmark time spent in GIL and GC coiled/benchmarks#248

Open

mrocklin mentioned this issue Nov 3, 2022

Read_parquet is slower than expected with S3 dask/dask#9619

Open

gjoseph92 mentioned this issue Feb 24, 2023

Using coiled/dask to parallelize making/storing task graphs coiled/feedback#229

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker event loop performance hampered by GIL/convoy effect #6325

Worker event loop performance hampered by GIL/convoy effect #6325

gjoseph92 commented May 11, 2022

fjetter commented May 16, 2022

ericman93 commented May 21, 2022

crusaderky commented May 22, 2022

ericman93 commented May 22, 2022

fjetter commented May 23, 2022

crusaderky commented May 24, 2022

ericman93 commented May 25, 2022 •

edited

Loading

crusaderky commented May 25, 2022

mrocklin commented May 25, 2022

Worker event loop performance hampered by GIL/convoy effect #6325

Worker event loop performance hampered by GIL/convoy effect #6325

Comments

gjoseph92 commented May 11, 2022

fjetter commented May 16, 2022

ericman93 commented May 21, 2022

crusaderky commented May 22, 2022

ericman93 commented May 22, 2022

fjetter commented May 23, 2022

crusaderky commented May 24, 2022

ericman93 commented May 25, 2022 • edited Loading

crusaderky commented May 25, 2022

mrocklin commented May 25, 2022

ericman93 commented May 25, 2022 •

edited

Loading