-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler.total_occupancy
is significant runtime cost
#7256
Comments
To try to see how much performance impact the occupancy calculation is having, I ran some benchmarks on a 128-worker cluster comparing There's only one repeat here (128w cluster is expensive), so not much statistical power. However, a 32% scheduler slowdown is a big effect, so if it was having an impact I would have expected it to show up across most tests even with one repeat. It doesn't seem to me like there's an obvious slowdown across the board. My guess is that I still think this is urgent to fix, but maybe it's not a critical blocker for #7278? |
For the sake of full transparency, the total_occupancy calculation is actually still fairly fast. In the case of test_anom_mean, we're dealing with 233 task groups. I fetched the TGs from a live scheduler and emulated a fake task_groups_count_global = {
f"tg-{ix}": 5
for ix in range(233)
} %%timeit
res = 0
for group_name, count in group_counts.items():
prefix = tgs[group_name].prefix
assert prefix is not None
duration = prefix.duration_average
if duration < 0:
if prefix.max_exec_time > 0:
duration = 2 * prefix.max_exec_time
else:
duration = 0.5
res += duration * count
occ = res + 100 / 100_000_000
I get Note that this assumes that That's of course still two orders of magnitude smaller than the simple attribute access it was before (which should happen in around 10-50ns) but it's small enough that I'm surprised to show up anywhere. This is called twice in every
This is not great but not too dramatic. However, your profile rather shows something like 70-80s which is 30x more (~100-300 times more if I'm not calculating with the worst case timing but a more realistic fraction of it) To be clear, I'm not arguing that this doesn't need fixing. I'm also not trying to defend my code, I'm just wondering if something else is going on we're just noticing because this is a bit slower, e.g. are we transitioning tasks back and forth unnecessarily or are we calling FWIW I'm wondering if we could get rid of |
Of course, maybe the graph you ran was bigger and the single-CPU perf of the VM is likely slower than my M1, etc. Just asking if there might be something else going on. |
Note: This has been introduced in #7075 and has been first released in |
FWIW this issue should be independent of the cluster size. The total_occupancy calculation depends solely on the number of task groups. I tried reproducing on a smaller cluster but it does not show up in the profile. |
Ok, on a similar cluster, this workload generates 1623 task groups with 180k tasks. I guess my math checks out, roughly. I had different plans for these counters but at the moment they are only used to calculate occupancy. |
Profiling the scheduler during a benchmark workload on a 128-worker (2cpu each) cluster, I noticed
total_occupancy
taking 32% of total scheduler runtime!Profile (go to left-heavy view):
sat-inf-anom_mean-clipped.json
Subjectively, the dashboard was also extremely laggy. (This is real time, not an artifact of the screen recording or network delay. Go to the end to see the dashboard becomes responsive once most tasks are done.)
amon-mean-128-no-queue-pyspy.mp4
cc @hendrikmakait @fjetter
The text was updated successfully, but these errors were encountered: