-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker raises CommClosedError
on client shutdown
#94
Comments
Actually if I submit enough jobs to to the worker (1000+), the |
Does calling |
No, same error. The job also hangs for 30s when leaving the context, then fails with a |
These are errors I've seen before. I can't remember what is going on right now, but I'll look into it when I'm back in the office in a week. Sorry for the wait. |
Thanks! Let me know if you need me to run more tests. |
Hi @kmpaul, have you had a chance to look at this? |
@lgarrison: I have not, yet. I'm sorry. This week got very messy. I absolutely will look into this on Monday. I'm sorry for the delay. |
@lgarrison: I've verified the error on both Ubuntu and Rocky Linux. This is actually a resurgence of #88, which notes exactly the issue that you are seeing. I had thought this was fixed with #89, but it is clearly coming back. |
(Note that #89 actually changed the |
...And I can further verify that these |
Due to the problems noted in #88 related to errors occurring during shutdown not resulting in non-0 exit codes, the PyTest suite is not catching errors. I think this is a significant problem and one that I will need some time to investigate and fix. In the meantime, @lgarrison, it doesn't look like the exit-time errors are resulting in incorrect results. Are you okay continuing "as-is" and just ignoring the |
Yes, I can ignore the errors for now. I'm working on instructions for scientists using our local HPC resources to farm out work to the cluster using dask-mpi, and I think these errors would be a source of confusion, even if the code executes correctly. So I can continue my experimentation, but I'll probably wait until this is fixed to start teaching users about it. (Alternatively, if there's a workflow, dask-based or otherwise, that you like to use for dynamically dispatching independent, Python-based tasks to a Slurm allocation, I'd be interested to hear about it!) |
Ok. I'll keep plugging away to try to get a solution as soon as possible. |
Sigh. @lgarrison: I've tracked down the errors to something beyond Dask-MPI. You can test if you are seeing the same thing as me, but I'm seeing MVP:With the latest versions of Dask and Distributed (on Linux), do the following. In Terminal 1: $ dask-scheduler Note the In Terminal 2: $ dask-worker ADDRESS:8786 where In Terminal 3: $ python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:36:39) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from distributed import Client
>>> client = Client('ADDRESS:8786')
>>> client.shutdown() where ResultsIn Terminal 1, the scheduler shuts down appropriately without errors. In Terminal 2, the worker shuts down, but not without error. The logs of the worker after
In Terminal 3, where the
And this does not stop repeating until the Python process is exited (e.g., Interestingly, you can avoid the error that appears in Terminal 2 in the worker logs if you call This happens with the latest versions of Dask and Distributed (from Conda-Forge) on Windows, too. |
Interesting, thanks! I was also able to reproduce this without dask-mpi following your instructions. I confirm that So, should I open an issue in an upstream repo? Would that be |
I'm opening an upstream issue right now. I'm seeing if I can figure out with Dask version introduced the regression. Then I'll submit the issue and report it here. |
There is definitely some strangeness produced by Python's |
Ok. The Dask Distributed issue has been created (dask/distributed#7192). I'm not sure how much more I want to work on Dask-MPI until I hear back about that issue, lest I spend too much time trying to design around an upstream bug. So, I'll return to Dask-MPI if/when I hear about a solution to dask/distributed#7192. |
@lgarrison: If you are following what is happening in dask/distributed#7192, then you probably know that I tried the |
(My thinking is that with a large number of workers, the |
Describe the issue:
I'm trying out a simple hello-world style
dask-mpi
example, and the computation returns the right result, but I'm getting exceptions when the client finishes. I'm running the below script under Slurm assrun -n3 python repro.py
, and the error is:I thought this might be related to #87, but I'm running on Python 3.8 and there's just an exception, no hang.
Am I doing something wrong? It looks to me like the worker is complaining because the scheduler shuts down before the worker does. Is this expected? If I manually force the workers to shut down before the client and scheduler do, with:
then everything exits with no exceptions. But this feels like a hack... am I missing something?
Minimal Complete Verifiable Example:
Full log:
Environment:
The text was updated successfully, but these errors were encountered: