-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring gahter_dep / Remove missing data message #6544
Conversation
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
@gen_cluster(client=True) | ||
async def test_worker_find_missing(c, s, a, b): | ||
fut = c.submit(inc, 1, workers=[a.address]) | ||
await fut | ||
# We do not want to use proper API since it would ensure that the cluster is | ||
# informed properly | ||
del a.data[fut.key] | ||
del a.tasks[fut.key] | ||
|
||
# Actually no worker has the data; the scheduler is supposed to reschedule | ||
assert await c.submit(inc, fut, workers=[b.address]) == 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is a bit nonsense. I don't think we should be allowed to simplify mess with the state and expect the system to recover
assert ts in self.data_needed | ||
assert ts.who_has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As motivated in the description, we rely now on the fact that there are tasks in fetch with this being empty but we rely on the transition system to pick this up and transition the task. This allows us to write easier recommendations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert ts in self.data_needed
is still valid, as the validate_task_state
is only called at the end of a transitions cycle. It's only in transition_fetch_missing
, which is in the middle of a cycle, that it is not.
block_get_data.set() | ||
|
||
await wait_for_state("x", "missing", a) | ||
# await wait_for_state("y", "missing", a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y is not reliably transitioned to missing since it is stuck in fetch, depending on how fast ther RefreshWhoHas comes in.
This could be restored if we transitioned to missing immediately but I pointed out already a couple of times that I don't think this is a requirement and by not doing this we have much simpler recommendation code
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 14m 12s ⏱️ - 13m 20s For more details on these failures, see this check. Results for commit 43fc518. ± Comparison against base commit 879fb89. |
del ts | ||
|
||
for ts in self._gather_dep_done_common(ev): | ||
ts.who_has.discard(worker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will never happen due to line 3444
else: | ||
ts.who_has.discard(ev.worker) | ||
self.has_what[ev.worker].discard(ts.key) | ||
recommendations[ts] = "fetch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommendations[ts] = "fetch" | |
if self.validate: | |
assert ts.state != "fetch" | |
assert ts not in self.data_needed_per_worker[ev.worker] | |
recommendations[ts] = "fetch" |
Otherwise _select_keys_for_gather
will fail later
This is building on top of #6388 with a few significant changes
All changes are in the latest commit on top of the PR for review
_gather_dep_done_common
which is basically solely to reset some state and set the ts.done attribute. All other shared usage of code for the result parsing of gather_dep has been a major source of inconsistencies and we should avoid itScheduler.handle_missing_data
? #6445 since the missing-data message is not required. I realize this is significant scope creep to the actual refactor. We could likely incorporate this again into the event handler upon success-but-missing but I would not add it to the network-failure result since this can cause otherwise severe race conditions, see also Add back Worker.transition_fetch_missing #6112 (comment)