-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError when looking at durations of executing tasks #4587
Comments
Hi, will this issue be fixed soon? I am seeing the similar issue several times. Is it possibly caused by unhealth nodes?
|
This is something we're also running into. Not sure why the task seems to be dropped. |
I could reproduce and fix the error in #5053. I wasn't able to actually make something fail, i.e. the cluster would still continue operating as before but this log would show up. the log itself shouldn't have been harmful and the cluster continued operations as expected after issuing the log. The warning itself appears to be triggered by a subtle race condition after tasks are cancelled while the worker is still executing something. This is not related to the worker state machine but rather an artefact of network latencies. Cancellations may sometimes not be obvious. In the above example, the cancellation is done implicitly via garbage collect once the variables If anything else is happening interfering with the expected behaviour, that's likely not the log but something else. |
The behavior that we seem to be experiencing is that the task that we're trying to get Dask to evaluate is "lost" and the scheduler seems to try over and over again to submit and evaluate the job successfully. The logs that we've seen sometimes reflect that this will work and our CI will continue as normal, only to stutter a few more times down the road. The end result seems to be a drastic increase in runtime as Dask attempts, fails, and starts over to complete its work. I'd be happy to try and provide more information, but this is all occurring on Gitlab external executors. I can say that the method by which we're using Dask is maybe a little unorthodox in that we're spinning up LocalClusters as the larger job, which is using dask to parallelize smaller components, is requesting them. Something like:
Not sure my methodology is causing something to accidentally get gc'd before it should.... |
@chukarsten I get why you perceive this kind of scheduling as unorthodox but rest assured, you're not the first one and we have a test for this, see distributed/distributed/tests/test_client.py Lines 6744 to 6770 in 07fe11d
I don't see a reason how this would interfere since the small-local-cluster is operating in a dedicated context and should not interfere with your global scheduler. I suspect another issue causing your tasks to be "lost". May I ask you to open another issue with a bit more information? For instance, is "big-cluster" or "small-cluster" loosing jobs? How exactly are you scheduling tasks with the client? ( |
This seems to be resolved in |
When stopping an in-flight computation I get the following traceback:
The actual computation I was running was the following, but I suspect that this won't be hard to reproduce.
cc @gforsyth @fjetter
The text was updated successfully, but these errors were encountered: