⚠️ CI failed ⚠️ - test_deadlock fails intermittently #166

github-actions · 2022-06-08T22:23:55Z

ncclementi · 2022-06-08T22:44:28Z

@jrbourbeau and @fjetter it looks like the deadlock might still be around? See Workflow run in the comment above.

fjetter · 2022-06-09T08:51:34Z

How do I debug this? Is there any information about this run? A coiled cluster ID, cluster dump, perf report or anything like that?

fjetter · 2022-06-09T14:29:25Z

Ok, I got it running locally after a while. Let's see if I can reproduce

fjetter · 2022-06-09T14:43:21Z

Thank you @ntabris for pointing out that there is a log line that references the cluster 🎉

coiled:cluster.py:552 Creating Cluster (name: test_deadlock-6a165f70a88347c1bafca759d0fb7a3e, https://cloud.coiled.io/dask-engineering/clusters/32950/details). This might take a few minutes...

so this would be https://cloud.coiled.io/dask-engineering/clusters/32950/details

fjetter · 2022-06-10T15:14:33Z

If this is again the swapping of pages we may need to consider adding limits on coiled, if that is possible

jrbourbeau · 2022-06-10T15:17:14Z

If this is again the swapping of pages we may need to consider adding limits on coiled, if that is possible

Sorry, I don't quite follow what you mean here @fjetter. Could you elaborate a bit more, or perhaps there's a ticket somewhere with additional context?

ncclementi · 2022-06-24T22:30:22Z

@fjetter While I was fixing some logic to get the cluster dump working for this particular test. I was able able to get a failure and a cluster dump of it.

CI failure: https://github.com/coiled/coiled-runtime/runs/7049211634?check_suite_focus=true#step:6:94

Cluster dump S3 uri: s3://coiled-runtime-ci/test-scratch/cluster_dumps/test_deadlock-994c7153388c45e0b34a77a1b6fe0a19/

I hope this helps. For future reference, every stability test that fails will have a line like this in the CI report where it points to the S3 URI (in the aws oss account) that has the file.
https://github.com/coiled/coiled-runtime/runs/7049211634?check_suite_focus=true#step:6:356

fjetter · 2022-06-27T15:01:51Z

At least the cluster dump shared above does not indicate an actual deadlock but rather a "timeout too large" problem.

What I can see is that worker A appears to be stuck fetch a task X from worker B. The test times out after about two minutes.

The cluster dump actually includes a connection timeout error trying to fetch the dump from worker B after 300s. This indicates that worker B is dead and our execution simply didn't continue since worker A was waiting for the timeout to occur as well.
Bottom line, we should not artificially increase the timeout to such a large value. I have no idea why the worker died (maybe out of memory, maybe the VM died, etc.) but this might be entirely unrelated.

fjetter · 2022-06-27T15:17:31Z

It appears Coiled is setting some large default values. I'll reset them in Coiled and we'll need to see if the issue persists afterwards.

fjetter · 2022-06-27T15:27:28Z

Indeed, the worker we're trying to connect to apparently died. It was restarted by the nanny but workers were trying to connect to the old instance

2022-06-24 22:10:30,703 - distributed.preloading - INFO - Creating preload: https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
2022-06-24 22:10:30,704 - distributed.preloading - INFO - Downloading preload at https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
2022-06-24 22:10:34,348 - distributed.preloading - INFO - Run preload setup: https://cloud.coiled.io/api/v2/cluster_facing/preload/worker
2022-06-24 22:10:34,359 - distributed.nanny - INFO -         Start Nanny at: 'tls://10.0.7.185:37777'
2022-06-24 22:10:35,001 - distributed.worker - INFO -       Start worker at:     tls://10.0.7.185:33101
2022-06-24 22:10:35,001 - distributed.worker - INFO -          Listening to:     tls://10.0.7.185:33101
2022-06-24 22:10:35,001 - distributed.worker - INFO -          dashboard at:           10.0.7.185:40769
2022-06-24 22:10:35,001 - distributed.worker - INFO - Waiting to connect to:      tls://10.0.7.148:8786
2022-06-24 22:10:35,001 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:35,001 - distributed.worker - INFO -               Threads:                          2
2022-06-24 22:10:35,001 - distributed.worker - INFO -                Memory:                   3.78 GiB
2022-06-24 22:10:35,001 - distributed.worker - INFO -       Local Directory: /scratch/dask-worker-space/worker-r884wcga
2022-06-24 22:10:35,001 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:35,299 - distributed.worker - INFO -         Registered to:      tls://10.0.7.148:8786
2022-06-24 22:10:35,300 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:35,300 - distributed.core - INFO - Starting established connection
2022-06-24 22:10:38,177 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
2022-06-24 22:10:38,550 - distributed.nanny - INFO - Nanny asking worker to close
2022-06-24 22:10:38,554 - distributed.worker - INFO - Stopping worker at tls://10.0.7.185:33101
2022-06-24 22:10:38,557 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-84216a7c-2923-40a4-902b-b4adb36310bf Address tls://10.0.7.185:33101 Status: Status.closing
2022-06-24 22:10:38,710 - distributed.nanny - WARNING - Restarting worker
2022-06-24 22:10:39,322 - distributed.worker - INFO -       Start worker at:     tls://10.0.7.185:41633
2022-06-24 22:10:39,322 - distributed.worker - INFO -          Listening to:     tls://10.0.7.185:41633
2022-06-24 22:10:39,322 - distributed.worker - INFO -          dashboard at:           10.0.7.185:39133
2022-06-24 22:10:39,322 - distributed.worker - INFO - Waiting to connect to:      tls://10.0.7.148:8786
2022-06-24 22:10:39,322 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:39,322 - distributed.worker - INFO -               Threads:                          2
2022-06-24 22:10:39,322 - distributed.worker - INFO -                Memory:                   3.78 GiB
2022-06-24 22:10:39,322 - distributed.worker - INFO -       Local Directory: /scratch/dask-worker-space/worker-nwumbvjc
2022-06-24 22:10:39,323 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:39,584 - distributed.worker - INFO - Starting Worker plugin coiled-aws-env
2022-06-24 22:10:39,585 - distributed.worker - INFO -         Registered to:      tls://10.0.7.148:8786
2022-06-24 22:10:39,585 - distributed.worker - INFO - -------------------------------------------------
2022-06-24 22:10:39,586 - distributed.core - INFO - Starting established connection

The log message Nanny asking worker to close is triggered on a Nanny.kill. I do not know where this kill message is coming from. From what I understand, the only place in dask where we're using kill instead of something else is if the memory monitor detects too large memory usage but this should be accompanied by appropriate warning logs

fjetter · 2022-06-27T18:01:08Z

I believe this kill request is actually issued by a client.restart which we're doing in this test right after the cluster was started. In the above log, the first worker is started right after cluster startup and the second one is started after the restart. What left me wondering is why any worker would ever try to connect to the first worker, i.e. how can this worker ever have tasks assigned such that who_has is populated anywhere (which it clearly is, see cluster dumps).

I found dask/distributed#6637 which causes the restart call not to block until the workers are cycled through but instead return immediately.
The run this cluster dump is from was running on 2022.6.0 which was already affected by this bug and explains how this cluster state got mixed up.

FWIW, this should be recoverable by "proper" timeouts but is still not great

gjoseph92 · 2022-07-19T21:44:37Z

We think this should be fixed by dask/distributed#6714. Hopefully we can close this after a few days if we don't see any more failures?

ncclementi · 2022-07-20T17:14:43Z

Thank you @gjoseph92 I will close this one then and remove the skip to test it out.

ncclementi · 2022-07-20T18:12:58Z

See this #212 (comment), I'll close after the revert PR is merged

fjetter changed the title ~~⚠️ CI failed ⚠️~~ ⚠️ CI failed ⚠️ - test_deadlock fails intermittently Jun 9, 2022

jrbourbeau mentioned this issue Jun 15, 2022

⚠️ CI failed ⚠️ #177

Closed

gjoseph92 self-assigned this Jun 15, 2022

This was referenced Jun 15, 2022

Update package versions #149

Merged

⚠️ CI failed ⚠️ #179

Closed

ncclementi mentioned this issue Jun 23, 2022

⚠️ CI failed ⚠️ #185

Closed

fjetter unassigned gjoseph92 Jun 24, 2022

fjetter self-assigned this Jun 27, 2022

fjetter mentioned this issue Jun 27, 2022

Ensure client.restart waits for workers to leave and come back dask/distributed#6637

Closed

This was referenced Jun 28, 2022

⚠️ CI failed ⚠️ #195

Closed

⚠️ CI failed ⚠️ #196

Closed

Add runn tests for 0.0.4 #197

Merged

⚠️ CI failed ⚠️ #200

Closed

⚠️ CI failed ⚠️ #201

Closed

⚠️ CI failed ⚠️ #202

Closed

ncclementi mentioned this issue Jul 5, 2022

skip test until distributed PR 6637 is merged #204

Merged

crusaderky assigned hendrikmakait and unassigned fjetter Jul 6, 2022

ncclementi closed this as completed Jul 20, 2022

ncclementi mentioned this issue Jul 20, 2022

Revert skip of deadlock test [DNM yet] #212

Open

ncclementi reopened this Jul 20, 2022

fjetter added ci-failure cause: upstream (dask) Upstream Dask issue (e.g. API breakage) cause: platform Platform issue (i.e. Coiled specific) labels Aug 10, 2022

This was referenced Aug 25, 2022

⚠️ CI failed ⚠️ #277

Closed

Tweaks to static site #278

Merged

Ownership over CI failures [discussion] #290

Open

This was referenced Sep 6, 2022

⚠️ CI failed ⚠️ #300

Closed

⚠️ CI failed ⚠️ #301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚠️ CI failed ⚠️ - test_deadlock fails intermittently #166

⚠️ CI failed ⚠️ - test_deadlock fails intermittently #166

github-actions bot commented Jun 8, 2022

ncclementi commented Jun 8, 2022

fjetter commented Jun 9, 2022

fjetter commented Jun 9, 2022

fjetter commented Jun 9, 2022

fjetter commented Jun 10, 2022

jrbourbeau commented Jun 10, 2022

ncclementi commented Jun 24, 2022

fjetter commented Jun 27, 2022

fjetter commented Jun 27, 2022

fjetter commented Jun 27, 2022 •

edited

Loading

fjetter commented Jun 27, 2022 •

edited

Loading

gjoseph92 commented Jul 19, 2022

ncclementi commented Jul 20, 2022

ncclementi commented Jul 20, 2022

⚠️ CI failed ⚠️ - test_deadlock fails intermittently #166

⚠️ CI failed ⚠️ - test_deadlock fails intermittently #166

Comments

github-actions bot commented Jun 8, 2022

ncclementi commented Jun 8, 2022

fjetter commented Jun 9, 2022

fjetter commented Jun 9, 2022

fjetter commented Jun 9, 2022

fjetter commented Jun 10, 2022

jrbourbeau commented Jun 10, 2022

ncclementi commented Jun 24, 2022

fjetter commented Jun 27, 2022

fjetter commented Jun 27, 2022

fjetter commented Jun 27, 2022 • edited Loading

fjetter commented Jun 27, 2022 • edited Loading

gjoseph92 commented Jul 19, 2022

ncclementi commented Jul 20, 2022

ncclementi commented Jul 20, 2022

fjetter commented Jun 27, 2022 •

edited

Loading

fjetter commented Jun 27, 2022 •

edited

Loading