Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not close worker on comm error in heartbeat #6492

Closed

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jun 2, 2022

We should not close a worker if the heartbeat fails. The heartbeat is establishing a dedicated connection which can fail for a multitude of reasons.

The Worker.batched_stream should be the single source of truth for inferring whether the scheduler is still alive. If this connection breaks up we will close the worker in Worker.handle_scheduler

If an unexpected exception occurs, the behavior is similar to a fail_hard which is much stricter than before but the proper behavior imo

@fjetter fjetter requested a review from gjoseph92 June 2, 2022 11:41
@fjetter
Copy link
Member Author

fjetter commented Jun 2, 2022

We might actually be even better of sending the heartbeats over the stream connection as well

@github-actions
Copy link
Contributor

github-actions bot commented Jun 2, 2022

Unit Test Results

       15 files  ±0         15 suites  ±0   6h 40m 23s ⏱️ + 14m 1s
  2 832 tests ±0    2 750 ✔️ +4    81 💤 ±0  1  - 3 
20 986 runs  ±0  20 035 ✔️ +2  943 💤  - 3  8 +2 

For more details on these failures, see this check.

Results for commit a409e1d. ± Comparison against base commit 7e49d88.

Comment on lines -1257 to -1260
# Scheduler is gone. Respect distributed.comm.timeouts.connect
if "Timed out trying to connect" in str(e):
logger.info("Timed out while trying to connect during heartbeat")
await self.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe seems important. Though I would assume that in this case, the batched comm would be closed too.

The only case I can think of is if the scheduler is out of memory and the process is flailing (#6177 / #6110, but on the scheduler side instead). In that case, the batched comm would not be closed, but maybe new RPC connections to the scheduler would fail? Regardless, I don't know if that would be justification to shut down the worker. I'm okay with the batched comm being the source of truth; I prefer having one source of truth.

@fjetter
Copy link
Member Author

fjetter commented Jun 3, 2022

FWIW I also think we should consider sending heartbeats over the stream. I don't see a reason why we'd want to use pooled connections but that's a change for another time

@gjoseph92
Copy link
Collaborator

The heartbeat is a request-response. Batched streams don't support responses, it's fire-and-forget. In principle though, it maybe makes sense to be over the stream (I'd want to think about this more). It doesn't seem like we're doing that much of importance with the response though:

if response["status"] == "missing":
# Scheduler thought we left. Reconnection is not supported, so just shut down.
logger.error(
f"Scheduler was unaware of this worker {self.address!r}. Shutting down."
)
await self.close()
return
self.scheduler_delay = response["time"] - middle
self.periodic_callbacks["heartbeat"].callback_time = (
response["heartbeat-interval"] * 1000
)
self.bandwidth_workers.clear()

@mrocklin
Copy link
Member

mrocklin commented Jun 3, 2022 via email

@gjoseph92
Copy link
Collaborator

Even when worker reconnect is re-added, just having one connection to negotiate seems a bit easier to me.

But again, this is out of scope for this PR. I think the changes here seem reasonable. Though tests are failing, so perhaps not?

@fjetter
Copy link
Member Author

fjetter commented Jun 10, 2022

The heartbeat is a request-response. Batched streams don't support responses, it's fire-and-forget. In principle though, it maybe makes sense to be over the stream (I'd want to think about this more). It doesn't seem like we're doing that much of importance with the response though:

Yes, I am aware but there is little int here that I would actually consider a "response". The response from scheduler side is entirely independent from the data the worker submits. Most information we submit in the heartbeat (both directions) is just in there because we're sending it frequently, not because they are actually related to the heartbeat.
The exceptions being the measurement of latency and the missing signal

As I said, that's a change for another day, iff we even do this. Right now, the value is marginal

@hendrikmakait
Copy link
Member

Is there a reason this PR has not been finished?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants