-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚠️ CI failed ⚠️ - test_deadlock fails intermittently #166
Comments
@jrbourbeau and @fjetter it looks like the deadlock might still be around? See Workflow run in the comment above. |
How do I debug this? Is there any information about this run? A coiled cluster ID, cluster dump, perf report or anything like that? |
Ok, I got it running locally after a while. Let's see if I can reproduce |
Thank you @ntabris for pointing out that there is a log line that references the cluster 🎉
so this would be https://cloud.coiled.io/dask-engineering/clusters/32950/details |
If this is again the swapping of pages we may need to consider adding limits on coiled, if that is possible |
Sorry, I don't quite follow what you mean here @fjetter. Could you elaborate a bit more, or perhaps there's a ticket somewhere with additional context? |
@fjetter While I was fixing some logic to get the cluster dump working for this particular test. I was able able to get a failure and a cluster dump of it. CI failure: https://github.com/coiled/coiled-runtime/runs/7049211634?check_suite_focus=true#step:6:94 Cluster dump S3 uri: s3://coiled-runtime-ci/test-scratch/cluster_dumps/test_deadlock-994c7153388c45e0b34a77a1b6fe0a19/ I hope this helps. For future reference, every stability test that fails will have a line like this in the CI report where it points to the S3 URI (in the aws oss account) that has the file. |
At least the cluster dump shared above does not indicate an actual deadlock but rather a "timeout too large" problem. What I can see is that worker A appears to be stuck fetch a task X from worker B. The test times out after about two minutes. The cluster dump actually includes a connection timeout error trying to fetch the dump from worker B after 300s. This indicates that worker B is dead and our execution simply didn't continue since worker A was waiting for the timeout to occur as well. |
It appears Coiled is setting some large default values. I'll reset them in Coiled and we'll need to see if the issue persists afterwards. |
Indeed, the worker we're trying to connect to apparently died. It was restarted by the nanny but workers were trying to connect to the old instance
The log message |
I believe this kill request is actually issued by a I found dask/distributed#6637 which causes the restart call not to block until the workers are cycled through but instead return immediately. FWIW, this should be recoverable by "proper" timeouts but is still not great |
We think this should be fixed by dask/distributed#6714. Hopefully we can close this after a few days if we don't see any more failures? |
Thank you @gjoseph92 I will close this one then and remove the skip to test it out. |
See this #212 (comment), I'll close after the revert PR is merged |
Workflow Run URL
The text was updated successfully, but these errors were encountered: