Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Allow a loop cycle after every message in the batch #5443

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Oct 20, 2021

I stumbled upon this thing and again realized how incredibly complex our system is. For those not familiar, asyncio treats a sleep 0 as a sentinel and skips exactly one loop iteration.
This sleep indentation is interesting because the state on master allows an event loop iteration after an entire batch of messages has been worked off. If I move the indentation and therefore allow a loop iteration after every message, a lot of our tests break. I don't think any of our tests should be as sensitive as to react to such a subtle change.

I'm wondering if looking into this would help us with overall CI stability

That's also interesting since we're treating this quite differently for couroutine functions and ordinary functions (see a few lines above)

FWIW I tried aligning the behaviour of these two cases a while ago in #4734 but failed to stabilize the branch. Back then I ran into many of our flaky tests while cleaning this up. FWIW, in the branch over there, I removed the sleep entirely

cc @crusaderky @jcrist

@crusaderky
Copy link
Collaborator

This is good as it brings up a lot of cruft.

e.g. one of the tests that failed, test_correct_bad_time_estimate,

    while not any(f.key in s.tasks for f in futures):
        await asyncio.sleep(0.001)
    assert not any(s.tasks[f.key] in steal.key_stealable for f in futures)

is currently waiting for at least one task to appear in the scheduler and then expecting all tasks to be there.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2022

Unit Test Results

       10 files   -        2         10 suites   - 2   5h 54m 10s ⏱️ - 1h 21m 6s
  2 607 tests ±       0    2 524 ✔️  -        4    80 💤 +    1    3 +  3 
12 973 runs   - 2 593  12 088 ✔️  - 2 399  874 💤  - 205  11 +11 

For more details on these failures, see this check.

Results for commit 8fd610c. ± Comparison against base commit b0dd9db.

♻️ This comment has been updated with latest results.

@fjetter fjetter force-pushed the async_sleep_zero_handle_stream branch from 4c078a3 to 9da4c46 Compare February 15, 2022 15:22
@fjetter
Copy link
Member Author

fjetter commented Feb 15, 2022

The following two tests seem to consistently fail on this branch

FAILED distributed/tests/test_scheduler.py::test_rebalance_raises_missing_data3[True]
FAILED distributed/tests/test_steal.py::test_correct_bad_time_estimate - KeyE...

@fjetter fjetter force-pushed the async_sleep_zero_handle_stream branch from 9da4c46 to 8fd610c Compare February 18, 2022 11:37
@fjetter
Copy link
Member Author

fjetter commented Feb 23, 2022

Had a look at the failure of distributed/tests/test_steal.py::test_correct_bad_time_estimate. It relies on a Scheduler._reevaluate_occupancy_worker being called. That only happens when CPU load is below 50%!!

next_time = timedelta(seconds=0.1)
if self.proc.cpu_percent() < 50:
workers: list = list(parent._workers.values())
nworkers: Py_ssize_t = len(workers)
i: Py_ssize_t
for i in range(nworkers):
ws: WorkerState = workers[worker_index % nworkers]
worker_index += 1
try:
if ws is None or not ws._processing:
continue
parent._reevaluate_occupancy_worker(ws)
finally:
del ws # lose ref
duration = time() - last
if duration > 0.005: # 5ms since last release
next_time = timedelta(seconds=duration * 5) # 25ms gap
break
self.loop.add_timeout(
next_time, self.reevaluate_occupancy, worker_index=worker_index
)

This may impact all tests related to

  • occupancy
  • idle / saturated
  • task duration
  • work stealing
  • Task assignment on specific workers (depends on the case we’re using Scheduler.worker_objective which sorts by occupancy)

so… basically this may impact everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants