-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wall Clock (s)
spikes for benchmarks/test_array.py::test_double_diff
without CI failure
#316
Comments
benchmarks/test_array.py::test_double_diff
without CI failureWall Clock (s)
spikes for benchmarks/test_array.py::test_double_diff
without CI failure
The suspicious run uses dask/dask@5f11ba9 and dask/distributed@f02d2f9. After benchmarking I suggest revisiting the benchmarking results on Monday to check whether this issue remains. cc @fjetter |
Just ran this on 2022.6.0 and got a runtime of ~110s This run already looks a bit suspicious with a couple of idling workers in the middle. I bet that this looks drastically different on one of the spikes For reference It's interesting that this workload already stresses out the scheduler |
Cross-referencing this here #299 @hendrikmakait the reason why we haven't got an alert is that in main we check if the last 3 runs are bigger than (mean + 2std) of the latest 10 values not-including the 3 we are looking at. Since this test is super noisy, I'm expecting this kind of behavior. That being said I reported seeing this in #299 but there was no followup. |
Update There seem to be a few issues at the same time: A bug with The runtime of |
FYI you should be able to get 1s interval cpu/mem metrics on staging now. I'm starting to look at these myself in grafana and would be happy to do this w/ someone on a call. |
I did 10 runs of double_diff, same cluster w/ Some runs were ~130s, some were >200s.
|
Logs for scheduler and the worker blocked on IO for my 10 runs: double-diff-4bf298ef-scheduler-523f5517d5-instance-2.txt |
After diving into the logs, it looks like the worker event loop gets unresponsive while spilling. For example, take a look at the last annotated run on the Grafana dashboard between 16:39:00 UTC and 16:42:00:
|
This is currently expected, but known to be problematic: dask/distributed#4424 |
Yeah, I think dask/distributed#4424 and the general fact that we're spilling so much might be the problem we see here. Disabling work-stealing might be worth a shot, I wouldn't be surprised if it were overly aggressive on some runs and causing the spilling. |
If a worker is blocked by IO while spilling/unspilling an important task that frees up parallelism that would also explain why we're seeing gaps in the task stream, i.e. the scheduler event loop block may be yet another red herring (but also smth to investigate since it definitely could cause a similar problem; at the very least task overproduction) |
I'm pretty convinced from the hardware metrics that IO blocking consistently explains the difference in test time. For example, I just ran another batch of Here are a couple other individual test runs where I annotated start of test (single line) and scheduler event loop being blocked (range). The annotation points aren't exact since I have 5s granularity, but it's enough to see that scheduler event loop is blocked near beginning of test, right around the time that worker CPU jumps up and when scheduler starts receiving messages: (For reference, I'm using The scheduler event loop pretty consistently is becoming unresponsive at the start of each Scheduler logs showing event loop unresponsive:
Local client logs showing test run times:
|
I'd guess that's When profiling the scheduler, it's common to see receiving the graph take a nontrivial amount of the total runtime, nearly all with the event loop blocked. |
After the changes in #328, I have not seen any major differences in runtime of |
We had a significant spike in
Wall Clock (s)
forbenchmarks/test_array.py::test_double_diff
but did not receive an alert.At the same time,
Average Memory (GiB)
dropped, which might suggest recent data transfer limiting to be the cause.https://coiled.github.io/coiled-runtime/coiled-upstream-py3.9.html
I am investigating further.
The text was updated successfully, but these errors were encountered: