Do not close worker on comm error in heartbeat #6492

fjetter · 2022-06-02T11:41:08Z

We should not close a worker if the heartbeat fails. The heartbeat is establishing a dedicated connection which can fail for a multitude of reasons.

The Worker.batched_stream should be the single source of truth for inferring whether the scheduler is still alive. If this connection breaks up we will close the worker in Worker.handle_scheduler

If an unexpected exception occurs, the behavior is similar to a fail_hard which is much stricter than before but the proper behavior imo

fjetter · 2022-06-02T11:43:14Z

We might actually be even better of sending the heartbeats over the stream connection as well

github-actions · 2022-06-02T12:27:18Z

Unit Test Results

      15 files ±0       15 suites ±0 6h 40m 23s ⏱️ + 14m 1s
  2 832 tests ±0   2 750 ✔️ +4   81 💤 ±0 1 ❌ - 3
20 986 runs ±0 20 035 ✔️ +2 943 💤 - 3 8 ❌ +2

For more details on these failures, see this check.

Results for commit a409e1d. ± Comparison against base commit 7e49d88.

gjoseph92 · 2022-06-03T15:54:28Z

distributed/worker.py

-            # Scheduler is gone. Respect distributed.comm.timeouts.connect
-            if "Timed out trying to connect" in str(e):
-                logger.info("Timed out while trying to connect during heartbeat")
-                await self.close()


This maybe seems important. Though I would assume that in this case, the batched comm would be closed too.

The only case I can think of is if the scheduler is out of memory and the process is flailing (#6177 / #6110, but on the scheduler side instead). In that case, the batched comm would not be closed, but maybe new RPC connections to the scheduler would fail? Regardless, I don't know if that would be justification to shut down the worker. I'm okay with the batched comm being the source of truth; I prefer having one source of truth.

fjetter · 2022-06-03T16:15:37Z

FWIW I also think we should consider sending heartbeats over the stream. I don't see a reason why we'd want to use pooled connections but that's a change for another time

gjoseph92 · 2022-06-03T17:08:57Z

The heartbeat is a request-response. Batched streams don't support responses, it's fire-and-forget. In principle though, it maybe makes sense to be over the stream (I'd want to think about this more). It doesn't seem like we're doing that much of importance with the response though:

distributed/distributed/worker.py

Lines 1243 to 1255 in 6d85a85

    
           if response["status"] == "missing": 
        
               # Scheduler thought we left. Reconnection is not supported, so just shut down. 
        
               logger.error( 
        
                   f"Scheduler was unaware of this worker {self.address!r}. Shutting down." 
        
               ) 
        
               await self.close() 
        
               return 
        
           self.scheduler_delay = response["time"] - middle 
        
           self.periodic_callbacks["heartbeat"].callback_time = ( 
        
               response["heartbeat-interval"] * 1000 
        
           ) 
        
           self.bandwidth_workers.clear()

mrocklin · 2022-06-03T17:40:50Z

Historically we used a separate comm because of worker reconnect. If the batched comm broke we wanted an independent system to reinitialize the worker. Totally fine reversing that decision today

…

On Fri, Jun 3, 2022 at 12:09 PM Gabe Joseph ***@***.***> wrote: The heartbeat is a request-response. Batched streams don't support responses, it's fire-and-forget. In principle though, it maybe makes sense to be over the stream (I'd want to think about this more). It doesn't seem like we're doing that much of importance with the response though: https://github.com/dask/distributed/blob/6d85a8536ee02fd2a9ae52f50de030d1cc7afe7e/distributed/worker.py#L1243-L1255 — Reply to this email directly, view it on GitHub <#6492 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCJBLL6SEVAUSDOB5DVNI33JANCNFSM5XUWDBGQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

gjoseph92 · 2022-06-03T18:03:31Z

Even when worker reconnect is re-added, just having one connection to negotiate seems a bit easier to me.

But again, this is out of scope for this PR. I think the changes here seem reasonable. Though tests are failing, so perhaps not?

fjetter · 2022-06-10T13:02:46Z

The heartbeat is a request-response. Batched streams don't support responses, it's fire-and-forget. In principle though, it maybe makes sense to be over the stream (I'd want to think about this more). It doesn't seem like we're doing that much of importance with the response though:

Yes, I am aware but there is little int here that I would actually consider a "response". The response from scheduler side is entirely independent from the data the worker submits. Most information we submit in the heartbeat (both directions) is just in there because we're sending it frequently, not because they are actually related to the heartbeat.
The exceptions being the measurement of latency and the missing signal

As I said, that's a change for another day, iff we even do this. Right now, the value is marginal

hendrikmakait · 2022-10-19T13:55:38Z

Is there a reason this PR has not been finished?

Do not close worker on comm error in heartbeat

a409e1d

fjetter requested a review from gjoseph92 June 2, 2022 11:41

gjoseph92 approved these changes Jun 3, 2022

View reviewed changes

hendrikmakait mentioned this pull request Oct 20, 2022

Do not close worker on comm error in heartbeat #7163

Merged

2 tasks

fjetter closed this Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not close worker on comm error in heartbeat #6492

Do not close worker on comm error in heartbeat #6492

fjetter commented Jun 2, 2022

fjetter commented Jun 2, 2022

github-actions bot commented Jun 2, 2022

gjoseph92 Jun 3, 2022

fjetter commented Jun 3, 2022

gjoseph92 commented Jun 3, 2022

mrocklin commented Jun 3, 2022 via email

gjoseph92 commented Jun 3, 2022

fjetter commented Jun 10, 2022

hendrikmakait commented Oct 19, 2022

Do not close worker on comm error in heartbeat #6492

Do not close worker on comm error in heartbeat #6492

Conversation

fjetter commented Jun 2, 2022

fjetter commented Jun 2, 2022

github-actions bot commented Jun 2, 2022

Unit Test Results

gjoseph92 Jun 3, 2022

Choose a reason for hiding this comment

fjetter commented Jun 3, 2022

gjoseph92 commented Jun 3, 2022

mrocklin commented Jun 3, 2022 via email

gjoseph92 commented Jun 3, 2022

fjetter commented Jun 10, 2022

hendrikmakait commented Oct 19, 2022