-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent deadlock with shuffle="p2p
when merging dataframes with many partitions
#6981
Comments
@wence- do you have any idea what line number that assertion error is coming from? |
I don't :(, I will try and find out. |
It's distributed/distributed/shuffle/multi_file.py Line 259 in acf6078
|
Ah, thanks. Probably another concurrency bug, I'd guess. The p2p shuffle code hasn't been touched in a while, and likely won't be touched for a while, so I don't expect anyone will try to fix this. Ok if I close? |
Do you mean a concurrency bug in distributed, or in "external" libraries.
I suppose this is OK, if the intention is to replace p2p shuffle code with something else. Otherwise, if this is just "low priority, but we would in theory like this to work", I would be +epsilon on leaving open (or I can schedule a reminder to check again in 3 months...) |
This seems like a valid bug. I don't think that it makes sense to close the issue because one person or one team chooses not to work on it. Others besides Gabe and the group around him can still jump in. |
Nice.
…On Mon, Oct 31, 2022 at 12:22 PM Lawrence Mitchell ***@***.***> wrote:
@wence- <https://github.com/wence-> in #7195
<#7195> I fixed a couple of
deadlocks that are connected to swallowed exceptions. In that PR we should
see the exceptions, if that's the problem
Running on that branch I'm unable to reproduce the original error and
(after a couple of repeats) have yet to see any hangs.
—
Reply to this email directly, view it on GitHub
<#6981 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTERHM67QUVEGVBM723WF7545ANCNFSM6AAAAAAQBMT5FI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Should be closed after #7268 Please reopen if the issue is not resolved |
What happened:
The code below (needs
typer
in addition to usual dask/distributed/pandas/numpy) pretty consistently hangs after a workerAssertionError
when using thep2p
shuffle option. If I have both many workers, and many partitions per worker. In particular on a 40 physical core Broadwell machine with plentiful (1TB) RAM, the following execution nearly always crashes and then hangs:At which point the dashboard shows that no tasks are processing (presumably because they are waiting for these now failed tasks), cluster dump attached below.
On the same system I could also reproduce with
--num-workers 4 --partitions-per-worker 1000
, though I was not able to on a different system (which has a faster disk and RAM).Minimal Complete Verifiable Example:
Reproducer
Environment:
2022.8.1+7.g19a51474c
2022.8.1+29.ga5d68657
3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) \n[GCC 10.3.0]
dask/label/dev
channel)Cluster Dump State:
cluster-dump.msgpack.gz
The text was updated successfully, but these errors were encountered: