Flaky test_worker_who_has_clears_after_failed_connection
#6831
Labels
flaky test
Intermittent failures on CI.
test_worker_who_has_clears_after_failed_connection
#6831
test_worker_who_has_clears_after_failed_connection
failed in #6822, but I don't think it's related to the changes there. (Or if it is, the changes shouldn't have broken it.) I think it's indicating an actual state machine bug.The test seems to be testing:
But the way the test is written, it's possible that the fetch (or some of it) manages to complete before worker N dies. It's just a race condition between an
os._exit
and the 0.1s delay on aSlowTransmitData
.When I looked at the cluster dump for this test, I noticed
os._exit
is kicked offIf some of the data is successfully fetched, then
_handle_gather_dep_success
won't be removing the keys fromself.has_what
andts.who_has
:distributed/distributed/worker_state_machine.py
Lines 2770 to 2785 in 4af2d0a
Notice that
self.has_what[ev.worker].discard(ts.key)
only happens in theelse
branch, when the key wasn't successfully received.@crusaderky @fjetter what's the intent of this test, and how should we update it?
BlockedGetData
instead of delay-basedSlowTransmitData
). Also disable stealing.self.has_what[ev.worker]
(andts.who_has
?) regardless of whether they're successfully fetched. I don't think anything is doing this, and they'll just sit around inhas_what
forever right now? EDIT:_purge_state
does, so this seems okay as-is.https://github.com/dask/distributed/runs/7666946504?check_suite_focus=true#step:11:1316
The text was updated successfully, but these errors were encountered: