-
-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not close worker on comm error in heartbeat #6492
Conversation
We might actually be even better of sending the heartbeats over the stream connection as well |
# Scheduler is gone. Respect distributed.comm.timeouts.connect | ||
if "Timed out trying to connect" in str(e): | ||
logger.info("Timed out while trying to connect during heartbeat") | ||
await self.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maybe seems important. Though I would assume that in this case, the batched comm would be closed too.
The only case I can think of is if the scheduler is out of memory and the process is flailing (#6177 / #6110, but on the scheduler side instead). In that case, the batched comm would not be closed, but maybe new RPC connections to the scheduler would fail? Regardless, I don't know if that would be justification to shut down the worker. I'm okay with the batched comm being the source of truth; I prefer having one source of truth.
FWIW I also think we should consider sending heartbeats over the stream. I don't see a reason why we'd want to use pooled connections but that's a change for another time |
The heartbeat is a request-response. Batched streams don't support responses, it's fire-and-forget. In principle though, it maybe makes sense to be over the stream (I'd want to think about this more). It doesn't seem like we're doing that much of importance with the response though: distributed/distributed/worker.py Lines 1243 to 1255 in 6d85a85
|
Historically we used a separate comm because of worker reconnect. If the
batched comm broke we wanted an independent system to reinitialize the
worker.
Totally fine reversing that decision today
…On Fri, Jun 3, 2022 at 12:09 PM Gabe Joseph ***@***.***> wrote:
The heartbeat is a request-response. Batched streams don't support
responses, it's fire-and-forget. In principle though, it maybe makes sense
to be over the stream (I'd want to think about this more). It doesn't seem
like we're doing that much of importance with the response though:
https://github.com/dask/distributed/blob/6d85a8536ee02fd2a9ae52f50de030d1cc7afe7e/distributed/worker.py#L1243-L1255
—
Reply to this email directly, view it on GitHub
<#6492 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCJBLL6SEVAUSDOB5DVNI33JANCNFSM5XUWDBGQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Even when worker reconnect is re-added, just having one connection to negotiate seems a bit easier to me. But again, this is out of scope for this PR. I think the changes here seem reasonable. Though tests are failing, so perhaps not? |
Yes, I am aware but there is little int here that I would actually consider a "response". The response from scheduler side is entirely independent from the data the worker submits. Most information we submit in the heartbeat (both directions) is just in there because we're sending it frequently, not because they are actually related to the heartbeat. As I said, that's a change for another day, iff we even do this. Right now, the value is marginal |
Is there a reason this PR has not been finished? |
We should not close a worker if the heartbeat fails. The heartbeat is establishing a dedicated connection which can fail for a multitude of reasons.
The
Worker.batched_stream
should be the single source of truth for inferring whether the scheduler is still alive. If this connection breaks up we will close the worker inWorker.handle_scheduler
If an unexpected exception occurs, the behavior is similar to a fail_hard which is much stricter than before but the proper behavior imo