-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not allow closing workers to be awaited again #5910
Conversation
Unit Test Results 16 files + 4 16 suites +4 7h 14m 9s ⏱️ + 1h 35m 40s For more details on these failures, see this check. Results for commit 0bbb27f. ± Comparison against base commit 70e1fca. ♻️ This comment has been updated with latest results. |
distributed/core.py
Outdated
@@ -247,7 +236,9 @@ def set_thread_ident(): | |||
self.thread_id = threading.get_ident() | |||
|
|||
self.io_loop.add_callback(set_thread_ident) | |||
self._startup_lock = asyncio.Lock() | |||
self._started = asyncio.Event() | |||
self.__status = Status.init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reusing the status
attribute for workers, nannies, schedulers and the core server is not easily possible without some major changes. There are many different semantics encoded and I don't believe the server base class should react to any status change of the child. I don't mind renaming this attribute but I believe there should be different stati
raise TimeoutError( | ||
"{} failed to start in {} seconds".format( | ||
type(self).__name__, timeout | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising this as a timeout error is just wrong
Working on this, I'm wondering if we should switch from an inheritance model to a composition model |
distributed/core.py
Outdated
@@ -247,7 +236,9 @@ def set_thread_ident(): | |||
self.thread_id = threading.get_ident() | |||
|
|||
self.io_loop.add_callback(set_thread_ident) | |||
self._startup_lock = asyncio.Lock() | |||
self._started = asyncio.Event() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This event will be bound to an event loop too early on Python 3.8 and 3.9 - you'll need to use LateBoundEvent or assign the event in async def _start(
c6c68a9
to
84f9c2d
Compare
01d73c7
to
ed38a9a
Compare
exc = OSError("Unable to contact Actor's worker") | ||
return _Error(exc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason test_failed_worker
in test_actor.py
hit this edge case now instead of the ValueError above
ed149ce
to
c6c6137
Compare
I chose a different path
|
e92456f
to
74a69a6
Compare
Checking in, what is the status here? |
I hope you don't mind, but I merged in main and pushed. I'm hoping to trigger CI. |
waiting for CI. GH actions is in maintenance right now https://www.githubstatus.com/ I suppose this is why there are no builds scheduled |
never mind, it's moving |
|
429def5
to
bf03e51
Compare
e4ef0fb
to
4eca4bf
Compare
9d5f20d
to
648d090
Compare
…6091) This reinstates #5883 which was reverted in #5961 / #5932 I could confirm the flakyness of `test_missing_data_errant_worker` after this change and am reasonably certain this is caused by #5910 which causes a closing worker to be restarted such that, even after `Worker.close` is done, the worker still appears to be partially up. The only reason I can see why this change promotes this behaviour is that if we no longer block the event loop while the threadpool is closing, this opens a much larger window for incoming requests to come in and being processed while close is running. Closes #6239
8616b03
to
0bbb27f
Compare
@@ -290,7 +290,6 @@ async def test_failed_worker(c, s, a, b): | |||
|
|||
assert "actor" in str(info.value).lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about moving this all into the pytest raises? with pytest.raises(ValueError, match=r"Worker holding Actor was lost\. Status: error'):
if not isinstance(exc.__cause__, expected_cause): | ||
raise exc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not isinstance(exc.__cause__, expected_cause): | |
raise exc | |
assert isinstance(exc.__cause__, expected_cause) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a slightly different thing, isn't it? To be honest, I'm not entirely sure which behavior is best. I suggest to keep it as is for now
@fjetter since this was merged we are seeing CI failures in https://github.com/dask/dask-kubernetes/runs/6320080731?check_suite_focus=true ______________________________ test_simplecluster ______________________________
k8s_cluster = <pytest_kind.cluster.KindCluster object at 0x7fb44ad9c070>
kopf_runner = <kopf._kits.runner.KopfRunner object at 0x7fb44afdf280>
gen_cluster = <function gen_cluster.<locals>.cm at 0x7fb44afb11f0>
@pytest.mark.timeout(180)
@pytest.mark.asyncio
async def test_simplecluster(k8s_cluster, kopf_runner, gen_cluster):
with kopf_runner as runner:
async with gen_cluster() as cluster_name:
scheduler_pod_name = "simple-cluster-scheduler"
worker_pod_name = "simple-cluster-default-worker-group-worker"
while scheduler_pod_name not in k8s_cluster.kubectl("get", "pods"):
await asyncio.sleep(0.1)
while cluster_name not in k8s_cluster.kubectl("get", "svc"):
await asyncio.sleep(0.1)
while worker_pod_name not in k8s_cluster.kubectl("get", "pods"):
await asyncio.sleep(0.1)
with k8s_cluster.port_forward(f"service/{cluster_name}", 8786) as port:
async with Client(
f"tcp://localhost:{port}", asynchronous=True
) as client:
> await client.wait_for_workers(2)
dask_kubernetes/operator/tests/test_operator.py:112:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/distributed/client.py:1328: in _wait_for_workers
while n_workers and running_workers(info) < n_workers:
/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/distributed/client.py:1321: in running_workers
[
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.0 = <dict_valueiterator object at 0x7fb44886e220>
[
ws
for ws in info["workers"].values()
> if ws["status"] == Status.running.name
]
)
E KeyError: 'status' I'm investigating now, perhaps the version of distributed between the CI and the kubernetes cluster is getting out of sync or something. But wanted to raise this in case you spotted anything obvious related to this change. |
Looks like it was our fault, this error happens when workers are running |
This fixes some of our deadlock situations causing tests to time out while closing a worker.
This is not a ready fix but rather a preliminary one to communicate the issue.
What's happening:
closing
. This will ensureWorker.close
is idempotent and let every other caller to await finished, seedistributed/distributed/worker.py
Lines 1647 to 1648 in 39c5e88
await self
await self
will attempt to start the worker again sinceclosing
is not part ofStatus.ANY_RUNNING
running
such that a second close attempt will actually try to run a close concurrently. the second call is actually what is deadlocking then while trying to shutdown the threadpool executorOpen questions:
cc @crusaderky @graingert
closes #5932