Integration test that focus on AMM informing our decision to toggle this on by default #140

fjetter · 2022-05-25T15:38:50Z

No description provided.

crusaderky · 2022-07-07T13:06:14Z

At the moment of writing, the AMM consists exclusively of the ReduceReplicas and the RetireWorker policies.
RetireWorker is covered by #135, so the scope of this story is ReduceReplicas exclusively.

We should scale test_ReduceReplicas_stress up to production size:

https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085
https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1111-L1130

We should also write a test with AMM completely disabled, to give us a term of comparison for performance and stability with vs. without AMM.

crusaderky · 2022-07-07T14:21:21Z

Between this issue and #135, there are four use cases to be implemented:

just run tensordot_stress on production-sized data
tensordot_stress with AMM ReduceReplicas turned on
tensordot_stress with AMM turned off and retiring 90% of the cluster while the computation is going
tensordot_stress with AMM ReduceReplicas turned on and retiring 90% of the cluster while the computation is going

crusaderky · 2022-07-07T14:22:31Z

DOD / AC

This story is done when the integration test portrays the behaviour of distributed on coiled as described above.
If it demonstrates a flaw, remediation of the flaw is out of scope.

fjetter · 2022-08-19T14:38:41Z

I agree with the above, this ticket should exclusively focus on ReduceReplicas. Questions I would like to be answered

Is enabling this by default impacting common workflows? Are any regressions/improvements detected in existing coiled-runtime benchmarks
Are there any relevant missing benchmarks we would like to be added to the benchmark suite
There should be an upstream ticket presenting the results of the benchmarking efforts to propose enabling this by default

Optional

The ReduceReplicas is currently "aggressively" set to one replica. I'm wondering if setting this to a less aggressive value would even accelerate us further.

crusaderky · 2022-08-20T11:50:20Z

The ReduceReplicas is currently "aggressively" set to one replica. I'm wondering if setting this to a less aggressive value would even accelerate us further.

It's set to

one replica, or
the number of workers with waiter tasks plus the number of waiter tasks not yet assigned to workers,

whatever is higher:
https://github.com/dask/distributed/blob/a80187a23bd317136d5dbbb02068bc8f90055e4e/distributed/active_memory_manager.py#L494-L502

crusaderky · 2022-09-16T12:14:12Z

Executive summary

The latest A/B replication machinery with statistical analysis (Repeat A/B tests #324) suffers from major levels of background noise, which can be mitigated by throwing money at the problem through a very high number of repeat runs (>15).
Failing that, a lower number of repeat runs (7, in this case study) will result in very noisy graphics, which could easily mislead an untrained eye and should not be shown as-is to the general public.
Enabling AMM ReduceReplicas across the whole spectrum of tests caused:
- no noticeable changes in terms of runtime
- no noticeable regressions in terms of memory usage
- improvements up to 30% in terms of memory usage, both average and peak, in some specific use cases.

Test setup

Baseline

python=3.9
coiled-runtime=0.1.0
dask=2022.9.0
distributed=2022.9.0
benchmarks only (no runtime, no stability)
repeat every test 7 times
test null hypothesis

AMM

Everything as baseline, with one difference:

distributed:
  scheduler:
    active-memory-manager:
      start: true

The above enables AMM ReduceReplicas to run every 2 seconds.

Full output

https://github.com/coiled/coiled-runtime/actions/runs/3063841757

Observations on noise (unrelated to AMM)

Given the current volatility in the outcomes of the test suite, 7 runs are not enough to produce obviously clear results. The null hypothesis contains several major flares that are not supposed to be there (see pictures below). Cross-comparison with the barcharts and timeseries reports shows how these flares are caused by one-off instances where either runtime took twice as much wall clock time, twice as much memory, or both.

Examples:

Given a flare that doubles memory usage every 1~3 times over 7 repeats, it's intuitive to understand that a runtime where the flare happened 3 times will be reported in the A/B plots as a major regression when compared to a runtime where it happened only once, and that only a much higher number of repeats would smooth it out and prevent false positives.

Everything displayed in these plots is just noise:

AMM-specific observations

Net of noise, the runs with AMM enabled highlight some changes, which can be generally distinguished from noise by the length of their tail - a spread-out flare with a long tail is typical of a high-variance measure, whereas measures that are in solid color (very high p-value) followed by a sudden drop denote a low-variance change.

Enhancements

Test name	Average memory (p>90%)	Peak memory (p>90%)
h2o/test_hd2_benchmarks.py::test_q3[5 GB (parquet)]	-10% to -15%	-15% to -20%
h2o/test_hd2_benchmarks.py::test_q7[5 GB (parquet)]	-10% to -15%	-10% to -15%
h2o/test_hd2_benchmarks.py::test_q8[5 GB (parquet)]	-25% to -30%	-25% to -30%
test_array.py::test_dot_product	-10% to -15%	-10% to -15%
test_dataframe.py::test_shuffle	-5% to -10%	-5% to -10%

Noise

For test_download_throughput[pandas], refer to #339.

crusaderky · 2022-09-16T12:15:05Z

CC @fjetter @hendrikmakait @hayesgb @mrocklin @jrbourbeau @gjoseph92

fjetter · 2022-09-16T12:25:46Z

Thank you, @crusaderky ! I believe this is a pretty convincing result. Would you do us the honor and open a PR to distributed to make this official?

We can also see some significant improvements in Wall Clock for something like test_vorticity (this one is working under a lot of memory pressure and is spilling a lot)

ntabris · 2022-09-16T12:25:59Z

My understanding is that in many of the benchmarks place stress on dask such that we wouldn't expect consistent performance. Is that correct? Are there cases were we would currently expect to see consistent performance but are seeing higher than expected variance / occasional flares? (Or is that not yet determined?)

fjetter · 2022-09-16T12:27:13Z

My understanding is that in many of the benchmarks place stress on dask such that we wouldn't expect consistent performance. Is that correct?

Yes. For some of these tests we're addressing this in #338

Are there cases were we would currently expect to see consistent performance but are seeing higher than expected variance / occasional flares? (Or is that not yet determined?)

I think #339 is an example of "not yet understood" (or rather not confirmed, yet)

fjetter mentioned this issue May 25, 2022

Expand coverage of stability integration tests #138

Open

7 tasks

jrbourbeau added the test idea label Jun 9, 2022

crusaderky mentioned this issue Jul 7, 2022

Integration tests: Graceful worker restart #135

Open

crusaderky mentioned this issue Jul 7, 2022

Integration tests: spill/unspill #136

Closed

crusaderky self-assigned this Sep 2, 2022

crusaderky assigned hendrikmakait, fjetter and hayesgb and unassigned hendrikmakait, fjetter and hayesgb Sep 16, 2022

crusaderky closed this as completed Sep 16, 2022

crusaderky mentioned this issue Sep 16, 2022

Enable Active Memory Manager by default dask/distributed#7042

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration test that focus on AMM informing our decision to toggle this on by default #140

Integration test that focus on AMM informing our decision to toggle this on by default #140

fjetter commented May 25, 2022

crusaderky commented Jul 7, 2022 •

edited

Loading

crusaderky commented Jul 7, 2022 •

edited

Loading

crusaderky commented Jul 7, 2022

fjetter commented Aug 19, 2022

crusaderky commented Aug 20, 2022 •

edited

Loading

crusaderky commented Sep 16, 2022 •

edited

Loading

crusaderky commented Sep 16, 2022 •

edited

Loading

fjetter commented Sep 16, 2022

ntabris commented Sep 16, 2022

fjetter commented Sep 16, 2022

Integration test that focus on AMM informing our decision to toggle this on by default #140

Integration test that focus on AMM informing our decision to toggle this on by default #140

Comments

fjetter commented May 25, 2022

crusaderky commented Jul 7, 2022 • edited Loading

crusaderky commented Jul 7, 2022 • edited Loading

crusaderky commented Jul 7, 2022

DOD / AC

fjetter commented Aug 19, 2022

crusaderky commented Aug 20, 2022 • edited Loading

crusaderky commented Sep 16, 2022 • edited Loading

Executive summary

Test setup

Baseline

AMM

Full output

Observations on noise (unrelated to AMM)

AMM-specific observations

Enhancements

Noise

crusaderky commented Sep 16, 2022 • edited Loading

fjetter commented Sep 16, 2022

ntabris commented Sep 16, 2022

fjetter commented Sep 16, 2022

crusaderky commented Jul 7, 2022 •

edited

Loading

crusaderky commented Jul 7, 2022 •

edited

Loading

crusaderky commented Aug 20, 2022 •

edited

Loading

crusaderky commented Sep 16, 2022 •

edited

Loading

crusaderky commented Sep 16, 2022 •

edited

Loading