Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler.total_occupancy is significant runtime cost #7256

Closed
gjoseph92 opened this issue Nov 3, 2022 · 6 comments · Fixed by #7288
Closed

Scheduler.total_occupancy is significant runtime cost #7256

gjoseph92 opened this issue Nov 3, 2022 · 6 comments · Fixed by #7288

Comments

@gjoseph92
Copy link
Collaborator

Profiling the scheduler during a benchmark workload on a 128-worker (2cpu each) cluster, I noticed total_occupancy taking 32% of total scheduler runtime!

Profile (go to left-heavy view):

Subjectively, the dashboard was also extremely laggy. (This is real time, not an artifact of the screen recording or network delay. Go to the end to see the dashboard becomes responsive once most tasks are done.)

amon-mean-128-no-queue-pyspy.mp4

cc @hendrikmakait @fjetter

@gjoseph92
Copy link
Collaborator Author

To try to see how much performance impact the occupancy calculation is having, I ran some benchmarks on a 128-worker cluster comparing main to the commit prior to #7075. (queuing off, so the occupancy code-path would be hit more.)

https://observablehq.com/@gjoseph92/snakebench?commits=3a15ac9&commits=2a505de&measure=duration&groupby=branch&show=all

untitled
pchart

There's only one repeat here (128w cluster is expensive), so not much statistical power. However, a 32% scheduler slowdown is a big effect, so if it was having an impact I would have expected it to show up across most tests even with one repeat.

It doesn't seem to me like there's an obvious slowdown across the board. My guess is that anom_mean was slower for memory/spill-to-disk reasons randomness, not this. But more repeats would confirm that—it's also possible the increased scheduler latency worsened memory use, which is an effect we've seen before #5083 (comment).

I still think this is urgent to fix, but maybe it's not a critical blocker for #7278?

@fjetter
Copy link
Member

fjetter commented Nov 10, 2022

For the sake of full transparency, the total_occupancy calculation is actually still fairly fast. In the case of test_anom_mean, we're dealing with 233 task groups.

I fetched the TGs from a live scheduler and emulated a fake task_groups_count_global. Timing this...

task_groups_count_global = {
    f"tg-{ix}": 5
    for ix in range(233)
}
%%timeit

res = 0
for group_name, count in group_counts.items():
    prefix = tgs[group_name].prefix
    assert prefix is not None
    duration = prefix.duration_average
    if duration < 0:
        if prefix.max_exec_time > 0:
            duration = 2 * prefix.max_exec_time
        else:
            duration = 0.5
    res += duration * count
occ = res + 100 / 100_000_000
    

I get 30.3 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Note that this assumes that task_groups_count_global includes counts from every task group which typically is not the case, so in practice we're dealing with a single digit µs call here.

That's of course still two orders of magnitude smaller than the simple attribute access it was before (which should happen in around 10-50ns) but it's small enough that I'm surprised to show up anywhere.

This is called twice in every _add_to_processing and _exit_processing_common, i.e. it's called at least for times for every task in our graph.
The task graph for test_anom_mean has only 22327 tasks, i.e. this would very naively give CPU time for this of about

30µs * 4 * 22327 ~ 2.7s

This is not great but not too dramatic. However, your profile rather shows something like 70-80s which is 30x more (~100-300 times more if I'm not calculating with the worst case timing but a more realistic fraction of it)

To be clear, I'm not arguing that this doesn't need fixing. I'm also not trying to defend my code, I'm just wondering if something else is going on we're just noticing because this is a bit slower, e.g. are we transitioning tasks back and forth unnecessarily or are we calling check_idle_saturated in a very hot loop even though we shouldn't, etc. (My first pass over the code looks OK. This result is still confusing and slightly concerning regardless of the performance of total_occupancy). A traditional cProfile would be interesting showing counts on the functions


FWIW I'm wondering if we could get rid of check_idle_saturated entirely. Need to think about this a bit more

@fjetter
Copy link
Member

fjetter commented Nov 10, 2022

Of course, maybe the graph you ran was bigger and the single-CPU perf of the VM is likely slower than my M1, etc. Just asking if there might be something else going on.

@fjetter
Copy link
Member

fjetter commented Nov 10, 2022

Note: This has been introduced in #7075 and has been first released in 2022.10.0

@fjetter
Copy link
Member

fjetter commented Nov 10, 2022

FWIW this issue should be independent of the cluster size. The total_occupancy calculation depends solely on the number of task groups.

I tried reproducing on a smaller cluster but it does not show up in the profile.
My suspicion atm is that with scaled_array_shape and a much larger cluster we are also creating a much larger graph and many, many more task groups

@fjetter
Copy link
Member

fjetter commented Nov 10, 2022

Ok, on a similar cluster, this workload generates 1623 task groups with 180k tasks. I guess my math checks out, roughly.

I had different plans for these counters but at the moment they are only used to calculate occupancy.
However, the durations are actually tracked on prefix level. While this is true, we can as well maintain these counters on prefix level. Should have identical results but we only have very, very few prefixes (typically single digit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants