`Wall Clock (s)` spikes for `benchmarks/test_array.py::test_double_diff` without CI failure #316

hendrikmakait · 2022-09-09T14:21:32Z

We had a significant spike in Wall Clock (s) for benchmarks/test_array.py::test_double_diff but did not receive an alert.

At the same time, Average Memory (GiB) dropped, which might suggest recent data transfer limiting to be the cause.

https://coiled.github.io/coiled-runtime/coiled-upstream-py3.9.html

I am investigating further.

The text was updated successfully, but these errors were encountered:

hendrikmakait · 2022-09-09T16:43:36Z

The suspicious run uses dask/dask@5f11ba9 and dask/distributed@f02d2f9.
The run before it uses dask/dask@e74888a and dask/distributed@c7d5ba7.

After benchmarkingtest_double_diff several times from my local machine comparing 2022.9.0 against the latest main, I have not been able to find anything suspicious. The numbers for both versions are comparable around 120s.

I suggest revisiting the benchmarking results on Monday to check whether this issue remains. cc @fjetter

hendrikmakait · 2022-09-12T08:04:23Z

Update

Over the weekend, the wall time has been highly variable:

I'm not sure what to make of this. We have seen a somewhat similar spike, though less pronounced, around 08/31.

fjetter · 2022-09-12T09:25:24Z

Just ran this on 2022.6.0 and got a runtime of ~110s

This run already looks a bit suspicious with a couple of idling workers in the middle. I bet that this looks drastically different on one of the spikes

For reference

It's interesting that this workload already stresses out the scheduler
Edit: Maybe not too interesting since there are 100k very fast tasks

perf-report-compressed.html.tar.gz

ncclementi · 2022-09-12T15:38:17Z

Cross-referencing this here #299

@hendrikmakait the reason why we haven't got an alert is that in main we check if the last 3 runs are bigger than (mean + 2std) of the latest 10 values not-including the 3 we are looking at.

Since this test is super noisy, I'm expecting this kind of behavior.

That being said I reported seeing this in #299 but there was no followup.

hendrikmakait · 2022-09-13T16:03:38Z

Update

There seem to be a few issues at the same time:

A bug with scaled_array_shape causes the number of tasks to get blown up by a factor of 4x for a particularly bad run. Suddenly we have 400k tasks instead of 100k, so it's understandable that the runs take longer. @gjoseph92 is debugging this.

The runtime of test_double_diff still differs quite a bit even with a consistent number of tasks. To debug this, @ntabris is working on making high-resolution CloudWatch metrics available so that we can correlate those with performance reports and logs. In the logs, we also see that the scheduler is unresponsive for more than 10 seconds:
distributed.core - INFO - Event loop was unresponsive in Scheduler for 11.31s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability..
I am working on getting visibility into what's keeping the scheduler busy.

ntabris · 2022-09-13T16:57:43Z

FYI you should be able to get 1s interval cpu/mem metrics on staging now. I'm starting to look at these myself in grafana and would be happy to do this w/ someone on a call.

ntabris · 2022-09-13T17:26:06Z

I did 10 runs of double_diff, same cluster w/ client.restart between each test, scheduler was t3.xlarge, 10 t3.large workers.

Some runs were ~130s, some were >200s.

2022-09-13 10:20:47,186 - double_diff - INFO - Starting test run 0 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:23:35,371 - double_diff - INFO - Test run 0 took 168.18530666595325 seconds

2022-09-13 10:23:48,736 - double_diff - INFO - Starting test run 1 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:26:03,423 - double_diff - INFO - Test run 1 took 134.6871834998019 seconds

2022-09-13 10:26:16,248 - double_diff - INFO - Starting test run 2 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:28:35,412 - double_diff - INFO - Test run 2 took 139.16386704239994 seconds

2022-09-13 10:28:48,304 - double_diff - INFO - Starting test run 3 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:31:07,336 - double_diff - INFO - Test run 3 took 139.03207729198039 seconds

2022-09-13 10:31:19,935 - double_diff - INFO - Starting test run 4 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:35:37,806 - double_diff - INFO - Test run 4 took 257.87163662491366 seconds

2022-09-13 10:35:50,653 - double_diff - INFO - Starting test run 5 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:37:49,975 - double_diff - INFO - Test run 5 took 119.32307670917362 seconds

2022-09-13 10:38:02,997 - double_diff - INFO - Starting test run 6 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:41:40,731 - double_diff - INFO - Test run 6 took 217.73480700002983 seconds

2022-09-13 10:41:53,458 - double_diff - INFO - Starting test run 7 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:45:47,499 - double_diff - INFO - Test run 7 took 234.04223141726106 seconds

2022-09-13 10:46:00,465 - double_diff - INFO - Starting test run 8 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:48:26,480 - double_diff - INFO - Test run 8 took 146.01530083315447 seconds

2022-09-13 10:48:39,064 - double_diff - INFO - Starting test run 9 on cluster double-diff-4bf298ef
Cluster memory: 71.46 GiB, target data size: 71.46 GiB
Input 1: 71.46 GiB - 3721 20.00 MiB chunks
Input 2: 71.46 GiB - 3721 20.00 MiB chunks
2022-09-13 10:51:26,785 - double_diff - INFO - Test run 9 took 167.72131154080853 seconds

ntabris · 2022-09-13T17:41:36Z

Logs for scheduler and the worker blocked on IO for my 10 runs:

double-diff-4bf298ef-scheduler-523f5517d5-instance-2.txt

double-diff-4bf298ef-worker-8809d332d3-instance.txt

double-diff-4bf298ef-worker-c475784a9d-instance.txt

hendrikmakait · 2022-09-13T18:05:03Z

After diving into the logs, it looks like the worker event loop gets unresponsive while spilling. For example, take a look at the last annotated run on the Grafana dashboard between 16:39:00 UTC and 16:42:00:
Worker logs

Sep 13 16:39:53 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:39:53,499 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.72s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:39:56 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:39:56,601 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.10s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:02 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:02,809 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.94s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:24 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:24,838 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.45s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:29 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:29,285 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.61s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:52 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:52,322 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.65s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:40:56 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:40:56,971 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.65s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:41:04 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:41:04,049 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.17s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 13 16:41:09 ip-10-6-55-39 cloud-init[1050]: 2022-09-13 16:41:09,489 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.96s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

gjoseph92 · 2022-09-13T18:09:13Z

it looks like the worker event loop gets unresponsive while spilling

This is currently expected, but known to be problematic: dask/distributed#4424

hendrikmakait · 2022-09-13T18:29:04Z

This is currently expected, but known to be problematic: dask/distributed#4424

Yeah, I think dask/distributed#4424 and the general fact that we're spilling so much might be the problem we see here. Disabling work-stealing might be worth a shot, I wouldn't be surprised if it were overly aggressive on some runs and causing the spilling.

fjetter · 2022-09-14T08:38:01Z

If a worker is blocked by IO while spilling/unspilling an important task that frees up parallelism that would also explain why we're seeing gaps in the task stream, i.e. the scheduler event loop block may be yet another red herring (but also smth to investigate since it definitely could cause a similar problem; at the very least task overproduction)

ntabris · 2022-09-14T12:55:27Z

I'm pretty convinced from the hardware metrics that IO blocking consistently explains the difference in test time. For example, I just ran another batch of double_diff test this morning, here's comparison between faster and slower run:

Here are a couple other individual test runs where I annotated start of test (single line) and scheduler event loop being blocked (range). The annotation points aren't exact since I have 5s granularity, but it's enough to see that scheduler event loop is blocked near beginning of test, right around the time that worker CPU jumps up and when scheduler starts receiving messages:

(For reference, I'm using t3.xlarge scheduler which has 2 physical cores, 4 "vCPUs".)

The scheduler event loop pretty consistently is becoming unresponsive at the start of each double_diff test, but it's also pretty consistently ~10s and doesn't explain variance in test times

Scheduler logs showing event loop unresponsive:

Sep 14 12:29:08 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:29:08,015 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 9.27s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:31:03 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:31:03,446 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 9.46s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:33:02 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:33:02,116 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 9.63s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:34:56 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:34:56,724 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 10.74s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:37:11 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:37:11,629 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 9.94s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Sep 14 12:39:09 ip-10-6-63-134 cloud-init[1107]: 2022-09-14 12:39:09,151 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 10.78s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

Local client logs showing test run times:

2022-09-14 06:28:27,429 - Starting test run 0 on cluster double-diff-9a7d0972
2022-09-14 06:30:10,391 - Test run 0 took 102.96056583384052 seconds

2022-09-14 06:30:23,117 - Starting test run 1 on cluster double-diff-9a7d0972
2022-09-14 06:32:09,539 - Test run 1 took 106.42107912525535 seconds

2022-09-14 06:32:22,013 - Starting test run 2 on cluster double-diff-9a7d0972
2022-09-14 06:34:02,671 - Test run 2 took 100.65662850020453 seconds

2022-09-14 06:34:15,045 - Starting test run 3 on cluster double-diff-9a7d0972
2022-09-14 06:36:18,574 - Test run 3 took 123.52084666723385 seconds

2022-09-14 06:36:31,033 - Starting test run 4 on cluster double-diff-9a7d0972
2022-09-14 06:38:15,393 - Test run 4 took 104.35821366589516 seconds

2022-09-14 06:38:27,569 - Starting test run 5 on cluster double-diff-9a7d0972
2022-09-14 06:40:49,635 - Test run 5 took 142.0649689170532 seconds

gjoseph92 · 2022-09-14T15:09:02Z

The scheduler event loop pretty consistently is becoming unresponsive at the start of each double_diff test, but it's also pretty consistently ~10s

I'd guess that's dask.order primarily, which blocks the event loop. Also deserialization, and creating and scheduling of the initial tasks.

When profiling the scheduler, it's common to see receiving the graph take a nontrivial amount of the total runtime, nearly all with the event loop blocked.

hendrikmakait · 2022-09-14T18:32:22Z

After the changes in #328, I have not seen any major differences in runtime of test_double_diff except for an initial delay before the cluster starts becoming busy which is likely due to issues with the client-scheduler connections, which spans the Atlantic Ocean in my case. It is also worth noting that the major discrepancies observed by @ntabris in #316 (comment) were using coiled-runtime==0.1.0 which already shows a pretty wild behavior on CI. With the latest upstream I was only able to observe minor influences of spilling onto the runtime of the tests. I suggest we monitor the CI behavior now that #328 has been merged and close this issue if that's the case.

ntabris · 2022-09-14T19:14:33Z

I also just ran the test 5 times on dask/distributed 2022.9.0 (t3.xlarge scheduler, 10 t3.large workers).

Test times were lower and consistent (103–105s), spill was significantly lower:

hendrikmakait · 2022-09-16T08:08:52Z

The latest two runs look pretty good, closing this for now.

hendrikmakait added the regression label Sep 9, 2022

hendrikmakait self-assigned this Sep 9, 2022

hendrikmakait changed the title ~~Wall clock spikes for benchmarks/test_array.py::test_double_diff without CI failure~~ Wall Clock (s) spikes for benchmarks/test_array.py::test_double_diff without CI failure Sep 9, 2022

hendrikmakait mentioned this issue Sep 13, 2022

test_basic_sum occasionally takes 340% time and 160% memory to complete #315

Closed

This was referenced Sep 13, 2022

Chunks with size literals ("20 MiB") can result in significantly different chunk sizes than requested dask/dask#9488

Closed

Remove size literals from chunks= #328

Merged

hendrikmakait closed this as completed Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Wall Clock (s)` spikes for `benchmarks/test_array.py::test_double_diff` without CI failure #316

`Wall Clock (s)` spikes for `benchmarks/test_array.py::test_double_diff` without CI failure #316

hendrikmakait commented Sep 9, 2022

hendrikmakait commented Sep 9, 2022

hendrikmakait commented Sep 12, 2022

fjetter commented Sep 12, 2022 •

edited

Loading

ncclementi commented Sep 12, 2022

hendrikmakait commented Sep 13, 2022

ntabris commented Sep 13, 2022

ntabris commented Sep 13, 2022

ntabris commented Sep 13, 2022 •

edited

Loading

hendrikmakait commented Sep 13, 2022

gjoseph92 commented Sep 13, 2022

hendrikmakait commented Sep 13, 2022 •

edited

Loading

fjetter commented Sep 14, 2022

ntabris commented Sep 14, 2022 •

edited

Loading

gjoseph92 commented Sep 14, 2022

hendrikmakait commented Sep 14, 2022 •

edited

Loading

ntabris commented Sep 14, 2022

hendrikmakait commented Sep 16, 2022

Wall Clock (s) spikes for benchmarks/test_array.py::test_double_diff without CI failure #316

Wall Clock (s) spikes for benchmarks/test_array.py::test_double_diff without CI failure #316

Comments

hendrikmakait commented Sep 9, 2022

hendrikmakait commented Sep 9, 2022

hendrikmakait commented Sep 12, 2022

fjetter commented Sep 12, 2022 • edited Loading

ncclementi commented Sep 12, 2022

hendrikmakait commented Sep 13, 2022

ntabris commented Sep 13, 2022

ntabris commented Sep 13, 2022

ntabris commented Sep 13, 2022 • edited Loading

hendrikmakait commented Sep 13, 2022

gjoseph92 commented Sep 13, 2022

hendrikmakait commented Sep 13, 2022 • edited Loading

fjetter commented Sep 14, 2022

ntabris commented Sep 14, 2022 • edited Loading

gjoseph92 commented Sep 14, 2022

hendrikmakait commented Sep 14, 2022 • edited Loading

ntabris commented Sep 14, 2022

hendrikmakait commented Sep 16, 2022

`Wall Clock (s)` spikes for `benchmarks/test_array.py::test_double_diff` without CI failure #316

`Wall Clock (s)` spikes for `benchmarks/test_array.py::test_double_diff` without CI failure #316

fjetter commented Sep 12, 2022 •

edited

Loading

ntabris commented Sep 13, 2022 •

edited

Loading

hendrikmakait commented Sep 13, 2022 •

edited

Loading

ntabris commented Sep 14, 2022 •

edited

Loading

hendrikmakait commented Sep 14, 2022 •

edited

Loading