Do not allow closing workers to be awaited again #5910

fjetter · 2022-03-07T18:45:19Z

This fixes some of our deadlock situations causing tests to time out while closing a worker.

This is not a ready fix but rather a preliminary one to communicate the issue.

What's happening:

A worker is closing
Worker.status is set to closing. This will ensure Worker.close is idempotent and let every other caller to await finished, see

distributed/distributed/worker.py

Lines 1647 to 1648 in 39c5e88

if self.status in (Status.closed, Status.closing):

await self.finished()
While it is closing, an incoming RPC triggers an await self
The await self will attempt to start the worker again since closing is not part of Status.ANY_RUNNING
The restart will set the status to running such that a second close attempt will actually try to run a close concurrently. the second call is actually what is deadlocking then while trying to shutdown the threadpool executor

Open questions:

Is it justified that we have multiple close attempts?
Why did this ever work?
...

cc @crusaderky @graingert

closes #5932

github-actions · 2022-03-07T23:51:05Z

Unit Test Results

      16 files +      4       16 suites +4 7h 14m 9s ⏱️ + 1h 35m 40s
  2 744 tests +      4   2 662 ✔️ +    15     80 💤 -   11 2 ❌ +1
21 853 runs +5 444 20 805 ✔️ +5 183 1 046 💤 +263 2 ❌ - 1

For more details on these failures, see this check.

Results for commit 0bbb27f. ± Comparison against base commit 70e1fca.

♻️ This comment has been updated with latest results.

distributed/core.py

fjetter · 2022-03-10T14:39:04Z

distributed/core.py

@@ -247,7 +236,9 @@ def set_thread_ident():
            self.thread_id = threading.get_ident()

        self.io_loop.add_callback(set_thread_ident)
-        self._startup_lock = asyncio.Lock()
+        self._started = asyncio.Event()
+        self.__status = Status.init


Reusing the status attribute for workers, nannies, schedulers and the core server is not easily possible without some major changes. There are many different semantics encoded and I don't believe the server base class should react to any status change of the child. I don't mind renaming this attribute but I believe there should be different stati

fjetter · 2022-03-10T14:41:05Z

distributed/core.py

-                        raise TimeoutError(
-                            "{} failed to start in {} seconds".format(
-                                type(self).__name__, timeout
-                            )
-                        )


Raising this as a timeout error is just wrong

fjetter · 2022-03-10T14:45:56Z

Working on this, I'm wondering if we should switch from an inheritance model to a composition model

graingert · 2022-03-10T15:22:22Z

distributed/core.py

@@ -247,7 +236,9 @@ def set_thread_ident():
            self.thread_id = threading.get_ident()

        self.io_loop.add_callback(set_thread_ident)
-        self._startup_lock = asyncio.Lock()
+        self._started = asyncio.Event()


This event will be bound to an event loop too early on Python 3.8 and 3.9 - you'll need to use LateBoundEvent or assign the event in async def _start(

distributed/core.py

distributed/utils_test.py

fjetter · 2022-03-22T14:09:06Z

distributed/actor.py

+                            exc = OSError("Unable to contact Actor's worker")
+                            return _Error(exc)


For some reason test_failed_worker in test_actor.py hit this edge case now instead of the ValueError above

fjetter · 2022-03-24T18:30:55Z

I chose a different path

No longer await the object itself. We don't want to restart, we merely want to check that it is already up. I added an event to track this
I still kept the refactorings about start I started with. To make things less entangled, I introduced start_unsafe which does not need to check for any concurrent startups, etc. and there is Server.start which wraps this and ensures that everything is handled properly.

mrocklin · 2022-03-25T13:24:47Z

Checking in, what is the status here?

mrocklin · 2022-03-25T13:25:41Z

I hope you don't mind, but I merged in main and pushed. I'm hoping to trigger CI.

fjetter · 2022-03-25T13:26:16Z

Checking in, what is the status here?

waiting for CI. GH actions is in maintenance right now https://www.githubstatus.com/ I suppose this is why there are no builds scheduled

fjetter · 2022-03-25T13:26:27Z

never mind, it's moving

fjetter · 2022-03-25T14:57:51Z

test_nanny_death_timeout is failing all over the place due to a bug in ConnectionPool. The pool is improperly raising an asyncio.CancelledError as an CommClosed. I already started on a patch for this

…6091) This reinstates #5883 which was reverted in #5961 / #5932 I could confirm the flakyness of `test_missing_data_errant_worker` after this change and am reasonably certain this is caused by #5910 which causes a closing worker to be restarted such that, even after `Worker.close` is done, the worker still appears to be partially up. The only reason I can see why this change promotes this behaviour is that if we no longer block the event loop while the threadpool is closing, this opens a much larger window for incoming requests to come in and being processed while close is running. Closes #6239

graingert · 2022-05-05T18:44:51Z

distributed/tests/test_actor.py

@@ -290,7 +290,6 @@ async def test_failed_worker(c, s, a, b):

    assert "actor" in str(info.value).lower()


how about moving this all into the pytest raises? with pytest.raises(ValueError, match=r"Worker holding Actor was lost\. Status: error'):

graingert · 2022-05-05T18:49:30Z

distributed/utils_test.py

+    if not isinstance(exc.__cause__, expected_cause):
+        raise exc


Suggested change

if not isinstance(exc.__cause__, expected_cause):

raise exc

assert isinstance(exc.__cause__, expected_cause)

That's a slightly different thing, isn't it? To be honest, I'm not entirely sure which behavior is best. I suggest to keep it as is for now

jacobtomlinson · 2022-05-06T12:21:27Z

@fjetter since this was merged we are seeing CI failures in dask-kubernetes when calling client.wait_for_workers().

https://github.com/dask/dask-kubernetes/runs/6320080731?check_suite_focus=true

______________________________ test_simplecluster ______________________________

k8s_cluster = <pytest_kind.cluster.KindCluster object at 0x7fb44ad9c070>
kopf_runner = <kopf._kits.runner.KopfRunner object at 0x7fb44afdf280>
gen_cluster = <function gen_cluster.<locals>.cm at 0x7fb44afb11f0>

    @pytest.mark.timeout(180)
    @pytest.mark.asyncio
    async def test_simplecluster(k8s_cluster, kopf_runner, gen_cluster):
        with kopf_runner as runner:
            async with gen_cluster() as cluster_name:
                scheduler_pod_name = "simple-cluster-scheduler"
                worker_pod_name = "simple-cluster-default-worker-group-worker"
                while scheduler_pod_name not in k8s_cluster.kubectl("get", "pods"):
                    await asyncio.sleep(0.1)
                while cluster_name not in k8s_cluster.kubectl("get", "svc"):
                    await asyncio.sleep(0.1)
                while worker_pod_name not in k8s_cluster.kubectl("get", "pods"):
                    await asyncio.sleep(0.1)
    
                with k8s_cluster.port_forward(f"service/{cluster_name}", 8786) as port:
                    async with Client(
                        f"tcp://localhost:{port}", asynchronous=True
                    ) as client:
>                       await client.wait_for_workers(2)

dask_kubernetes/operator/tests/test_operator.py:112: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/distributed/client.py:1328: in _wait_for_workers
    while n_workers and running_workers(info) < n_workers:
/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/distributed/client.py:1321: in running_workers
    [
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

.0 = <dict_valueiterator object at 0x7fb44886e220>

        [
            ws
            for ws in info["workers"].values()
>           if ws["status"] == Status.running.name
        ]
    )
E   KeyError: 'status'

I'm investigating now, perhaps the version of distributed between the CI and the kubernetes cluster is getting out of sync or something. But wanted to raise this in case you spotted anything obvious related to this change.

jacobtomlinson · 2022-05-06T12:38:39Z

Looks like it was our fault, this error happens when workers are running 2022.5.0 but the client is using @main. Fixed our CI in dask/dask-kubernetes/pull/461 so that the workers are on @main too. Sorry for the noise.

graingert reviewed Mar 8, 2022

View reviewed changes

distributed/core.py Outdated Show resolved Hide resolved

crusaderky reviewed Mar 8, 2022

View reviewed changes

distributed/core.py Outdated Show resolved Hide resolved

This was referenced Mar 8, 2022

Unblock event loop while waiting for ThreadpoolExecutor to shut down #5883

Merged

Support dumping cluster state to URL #5863

Merged

fjetter commented Mar 10, 2022

View reviewed changes

fjetter changed the title ~~WIP Do not allow closing workers to be awaited again~~ Do not allow closing workers to be awaited again Mar 10, 2022

fjetter commented Mar 10, 2022

View reviewed changes

fjetter self-assigned this Mar 10, 2022

graingert reviewed Mar 10, 2022

View reviewed changes

distributed/core.py Outdated Show resolved Hide resolved

fjetter force-pushed the worker_close_deadlock branch from c6c68a9 to 84f9c2d Compare March 14, 2022 18:33

graingert reviewed Mar 15, 2022

View reviewed changes

distributed/utils_test.py Outdated Show resolved Hide resolved

fjetter force-pushed the worker_close_deadlock branch 2 times, most recently from 01d73c7 to ed38a9a Compare March 22, 2022 14:07

fjetter commented Mar 22, 2022

View reviewed changes

fjetter force-pushed the worker_close_deadlock branch from ed149ce to c6c6137 Compare March 24, 2022 18:28

fjetter force-pushed the worker_close_deadlock branch 2 times, most recently from e92456f to 74a69a6 Compare March 25, 2022 11:50

fjetter mentioned this pull request Mar 25, 2022

Maybe fix test_worker_waits_for_scheduler #5995

Closed

fjetter mentioned this pull request Mar 25, 2022

Use dependency injection for proc memory mocks #6004

Closed

fjetter mentioned this pull request Mar 25, 2022

Do not catch CancelledError in CommPool #6005

Merged

fjetter force-pushed the worker_close_deadlock branch from 429def5 to bf03e51 Compare April 1, 2022 10:14

This was referenced Apr 8, 2022

Error closing a local cluster when client still running #6087

Closed

Unblock event loop while waiting for ThreadpoolExecutor to shut down #6091

Merged

fjetter force-pushed the worker_close_deadlock branch 3 times, most recently from e4ef0fb to 4eca4bf Compare April 10, 2022 14:58

fjetter mentioned this pull request Apr 28, 2022

Worker stuck in closing_gracefully state #6223

Closed

fjetter force-pushed the worker_close_deadlock branch 2 times, most recently from 9d5f20d to 648d090 Compare April 28, 2022 13:45

fjetter mentioned this pull request Apr 29, 2022

Flaky test_missing_data_errant_worker #5932

Closed

fjetter added 3 commits April 29, 2022 16:25

Do not allow closing workers to be awaited again

e5252b9

Move _send_worker_status_change to task and await it

18b23d0

clean up a bit

0bbb27f

fjetter force-pushed the worker_close_deadlock branch from 8616b03 to 0bbb27f Compare April 29, 2022 14:26

This was referenced May 5, 2022

Ensure resumed tasks are not accidentally forgotten #6217

Merged

Fix batchedsend restart #6272

Closed

graingert reviewed May 5, 2022

View reviewed changes

graingert approved these changes May 5, 2022

View reviewed changes

fjetter merged commit 2286896 into dask:main May 5, 2022

This was referenced May 10, 2022

dask-worker process remains alive after Nanny exception on plugins= #6320

Closed

Use conda to build python packages during GPU tests rapidsai/dask-cuda#897

Merged

fjetter mentioned this pull request May 16, 2022

AssertionError: Status.created #6183

Open

pentschev mentioned this pull request May 20, 2022

Remove report and safe from Worker.close #6363

Merged

1 task

fjetter mentioned this pull request May 24, 2022

Ensure Nanny doesn't restart workers that fail to start, and joins subprocess #6427

Merged

2 tasks

ntabris mentioned this pull request Jun 22, 2022

⚠️ CI failed ⚠️ coiled/benchmarks#170

Closed

fjetter mentioned this pull request Jun 23, 2022

Remove unused __started Event in Server #6615

Merged

fjetter mentioned this pull request Nov 15, 2022

Ensure Server.close cannot run concurrently #7313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not allow closing workers to be awaited again #5910

Do not allow closing workers to be awaited again #5910

fjetter commented Mar 7, 2022 •

edited

Loading

github-actions bot commented Mar 7, 2022 •

edited

Loading

fjetter Mar 10, 2022

fjetter Mar 10, 2022

fjetter commented Mar 10, 2022

graingert Mar 10, 2022

fjetter Mar 22, 2022

fjetter commented Mar 24, 2022

mrocklin commented Mar 25, 2022

mrocklin commented Mar 25, 2022

fjetter commented Mar 25, 2022

fjetter commented Mar 25, 2022

fjetter commented Mar 25, 2022

graingert May 5, 2022

graingert May 5, 2022

fjetter May 5, 2022

jacobtomlinson commented May 6, 2022 •

edited

Loading

jacobtomlinson commented May 6, 2022

	if self.status in (Status.closed, Status.closing):
	await self.finished()

		exc = OSError("Unable to contact Actor's worker")
		return _Error(exc)

		@@ -290,7 +290,6 @@ async def test_failed_worker(c, s, a, b):

		assert "actor" in str(info.value).lower()

	if not isinstance(exc.__cause__, expected_cause):
	raise exc
	assert isinstance(exc.__cause__, expected_cause)

Do not allow closing workers to be awaited again #5910

Do not allow closing workers to be awaited again #5910

Conversation

fjetter commented Mar 7, 2022 • edited Loading

github-actions bot commented Mar 7, 2022 • edited Loading

Unit Test Results

fjetter Mar 10, 2022

Choose a reason for hiding this comment

fjetter Mar 10, 2022

Choose a reason for hiding this comment

fjetter commented Mar 10, 2022

graingert Mar 10, 2022

Choose a reason for hiding this comment

fjetter Mar 22, 2022

Choose a reason for hiding this comment

fjetter commented Mar 24, 2022

mrocklin commented Mar 25, 2022

mrocklin commented Mar 25, 2022

fjetter commented Mar 25, 2022

fjetter commented Mar 25, 2022

fjetter commented Mar 25, 2022

graingert May 5, 2022

Choose a reason for hiding this comment

graingert May 5, 2022

Choose a reason for hiding this comment

fjetter May 5, 2022

Choose a reason for hiding this comment

jacobtomlinson commented May 6, 2022 • edited Loading

jacobtomlinson commented May 6, 2022

fjetter commented Mar 7, 2022 •

edited

Loading

github-actions bot commented Mar 7, 2022 •

edited

Loading

jacobtomlinson commented May 6, 2022 •

edited

Loading