Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️ CI failed ⚠️ - test_deadlock fails intermittently #166

Open
github-actions bot opened this issue Jun 8, 2022 · 14 comments
Open

⚠️ CI failed ⚠️ - test_deadlock fails intermittently #166

github-actions bot opened this issue Jun 8, 2022 · 14 comments
Assignees
Labels
cause: platform Platform issue (i.e. Coiled specific) cause: upstream (dask) Upstream Dask issue (e.g. API breakage) ci-failure

Comments

@github-actions
Copy link

github-actions bot commented Jun 8, 2022

Workflow Run URL

@ncclementi
Copy link
Contributor

@jrbourbeau and @fjetter it looks like the deadlock might still be around? See Workflow run in the comment above.

@fjetter
Copy link
Member

fjetter commented Jun 9, 2022

How do I debug this? Is there any information about this run? A coiled cluster ID, cluster dump, perf report or anything like that?

@fjetter
Copy link
Member

fjetter commented Jun 9, 2022

Ok, I got it running locally after a while. Let's see if I can reproduce

@fjetter
Copy link
Member

fjetter commented Jun 9, 2022

Thank you @ntabris for pointing out that there is a log line that references the cluster 🎉

coiled:cluster.py:552 Creating Cluster (name: test_deadlock-6a165f70a88347c1bafca759d0fb7a3e, https://cloud.coiled.io/dask-engineering/clusters/32950/details). This might take a few minutes...

so this would be https://cloud.coiled.io/dask-engineering/clusters/32950/details

@fjetter fjetter changed the title ⚠️ CI failed ⚠️ ⚠️ CI failed ⚠️ - test_deadlock fails intermittently Jun 9, 2022
@fjetter
Copy link
Member

fjetter commented Jun 10, 2022

If this is again the swapping of pages we may need to consider adding limits on coiled, if that is possible

@jrbourbeau
Copy link
Member

If this is again the swapping of pages we may need to consider adding limits on coiled, if that is possible

Sorry, I don't quite follow what you mean here @fjetter. Could you elaborate a bit more, or perhaps there's a ticket somewhere with additional context?

@ncclementi
Copy link
Contributor

@fjetter While I was fixing some logic to get the cluster dump working for this particular test. I was able able to get a failure and a cluster dump of it.

CI failure: https://github.com/coiled/coiled-runtime/runs/7049211634?check_suite_focus=true#step:6:94

Cluster dump S3 uri: s3://coiled-runtime-ci/test-scratch/cluster_dumps/test_deadlock-994c7153388c45e0b34a77a1b6fe0a19/

I hope this helps. For future reference, every stability test that fails will have a line like this in the CI report where it points to the S3 URI (in the aws oss account) that has the file.
https://github.com/coiled/coiled-runtime/runs/7049211634?check_suite_focus=true#step:6:356

@fjetter fjetter self-assigned this Jun 27, 2022
@fjetter
Copy link
Member

fjetter commented Jun 27, 2022

At least the cluster dump shared above does not indicate an actual deadlock but rather a "timeout too large" problem.

What I can see is that worker A appears to be stuck fetch a task X from worker B. The test times out after about two minutes.

The cluster dump actually includes a connection timeout error trying to fetch the dump from worker B after 300s. This indicates that worker B is dead and our execution simply didn't continue since worker A was waiting for the timeout to occur as well.
Bottom line, we should not artificially increase the timeout to such a large value. I have no idea why the worker died (maybe out of memory, maybe the VM died, etc.) but this might be entirely unrelated.

@fjetter
Copy link
Member

fjetter commented Jun 27, 2022

It appears Coiled is setting some large default values. I'll reset them in Coiled and we'll need to see if the issue persists afterwards.

@fjetter
Copy link
Member

fjetter commented Jun 27, 2022

Indeed, the worker we're trying to connect to apparently died. It was restarted by the nanny but workers were trying to connect to the old instance

2022-06-24 22:10:30,703 - distributed.preloading - INFO - Creating preload: https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
2022-06-24 22:10:30,704 - distributed.preloading - INFO - Downloading preload at https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
2022-06-24 22:10:34,348 - distributed.preloading - INFO - Run preload setup: https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
2022-06-24 22:10:34,359 - distributed.nanny - INFO -         Start Nanny at: 'tls://10.0.7.185:37777'
2022-06-24 22:10:35,001 - distributed.worker - INFO -       Start worker at:     tls://10.0.7.185:33101
2022-06-24 22:10:35,001 - distributed.worker - INFO -          Listening to:     tls://10.0.7.185:33101
2022-06-24 22:10:35,001 - distributed.worker - INFO -          dashboard at:           10.0.7.185:40769
2022-06-24 22:10:35,001 - distributed.worker - INFO - Waiting to connect to:      tls://10.0.7.148:8786
2022-06-24 22:10:35,001 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:35,001 - distributed.worker - INFO -               Threads:                          2
2022-06-24 22:10:35,001 - distributed.worker - INFO -                Memory:                   3.78 GiB
2022-06-24 22:10:35,001 - distributed.worker - INFO -       Local Directory: /scratch/dask-worker-space/worker-r884wcga
2022-06-24 22:10:35,001 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:35,299 - distributed.worker - INFO -         Registered to:      tls://10.0.7.148:8786
2022-06-24 22:10:35,300 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:35,300 - distributed.core - INFO - Starting established connection
2022-06-24 22:10:38,177 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
2022-06-24 22:10:38,550 - distributed.nanny - INFO - Nanny asking worker to close
2022-06-24 22:10:38,554 - distributed.worker - INFO - Stopping worker at tls://10.0.7.185:33101
2022-06-24 22:10:38,557 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-84216a7c-2923-40a4-902b-b4adb36310bf Address tls://10.0.7.185:33101 Status: Status.closing
2022-06-24 22:10:38,710 - distributed.nanny - WARNING - Restarting worker
2022-06-24 22:10:39,322 - distributed.worker - INFO -       Start worker at:     tls://10.0.7.185:41633
2022-06-24 22:10:39,322 - distributed.worker - INFO -          Listening to:     tls://10.0.7.185:41633
2022-06-24 22:10:39,322 - distributed.worker - INFO -          dashboard at:           10.0.7.185:39133
2022-06-24 22:10:39,322 - distributed.worker - INFO - Waiting to connect to:      tls://10.0.7.148:8786
2022-06-24 22:10:39,322 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:39,322 - distributed.worker - INFO -               Threads:                          2
2022-06-24 22:10:39,322 - distributed.worker - INFO -                Memory:                   3.78 GiB
2022-06-24 22:10:39,322 - distributed.worker - INFO -       Local Directory: /scratch/dask-worker-space/worker-nwumbvjc
2022-06-24 22:10:39,323 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:39,584 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
2022-06-24 22:10:39,585 - distributed.worker - INFO -         Registered to:      tls://10.0.7.148:8786
2022-06-24 22:10:39,585 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:39,586 - distributed.core - INFO - Starting established connection

The log message Nanny asking worker to close is triggered on a Nanny.kill. I do not know where this kill message is coming from. From what I understand, the only place in dask where we're using kill instead of something else is if the memory monitor detects too large memory usage but this should be accompanied by appropriate warning logs

@fjetter
Copy link
Member

fjetter commented Jun 27, 2022

I believe this kill request is actually issued by a client.restart which we're doing in this test right after the cluster was started. In the above log, the first worker is started right after cluster startup and the second one is started after the restart. What left me wondering is why any worker would ever try to connect to the first worker, i.e. how can this worker ever have tasks assigned such that who_has is populated anywhere (which it clearly is, see cluster dumps).

I found dask/distributed#6637 which causes the restart call not to block until the workers are cycled through but instead return immediately.
The run this cluster dump is from was running on 2022.6.0 which was already affected by this bug and explains how this cluster state got mixed up.

FWIW, this should be recoverable by "proper" timeouts but is still not great

@gjoseph92
Copy link
Contributor

We think this should be fixed by dask/distributed#6714. Hopefully we can close this after a few days if we don't see any more failures?

@ncclementi
Copy link
Contributor

Thank you @gjoseph92 I will close this one then and remove the skip to test it out.

@ncclementi
Copy link
Contributor

See this #212 (comment), I'll close after the revert PR is merged

@fjetter fjetter added ci-failure cause: upstream (dask) Upstream Dask issue (e.g. API breakage) cause: platform Platform issue (i.e. Coiled specific) labels Aug 10, 2022
This was referenced Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cause: platform Platform issue (i.e. Coiled specific) cause: upstream (dask) Upstream Dask issue (e.g. API breakage) ci-failure
Projects
None yet
Development

No branches or pull requests

5 participants