Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration test that focus on AMM informing our decision to toggle this on by default #140

Closed
Tracked by #138
fjetter opened this issue May 25, 2022 · 10 comments
Closed
Tracked by #138
Assignees

Comments

@fjetter
Copy link
Member

fjetter commented May 25, 2022

No description provided.

@crusaderky
Copy link
Contributor

crusaderky commented Jul 7, 2022

At the moment of writing, the AMM consists exclusively of the ReduceReplicas and the RetireWorker policies.
RetireWorker is covered by #135, so the scope of this story is ReduceReplicas exclusively.

We should scale test_ReduceReplicas_stress up to production size:

https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085
https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1111-L1130

We should also write a test with AMM completely disabled, to give us a term of comparison for performance and stability with vs. without AMM.

@crusaderky
Copy link
Contributor

crusaderky commented Jul 7, 2022

Between this issue and #135, there are four use cases to be implemented:

  • just run tensordot_stress on production-sized data
  • tensordot_stress with AMM ReduceReplicas turned on
  • tensordot_stress with AMM turned off and retiring 90% of the cluster while the computation is going
  • tensordot_stress with AMM ReduceReplicas turned on and retiring 90% of the cluster while the computation is going

@crusaderky
Copy link
Contributor

DOD / AC

This story is done when the integration test portrays the behaviour of distributed on coiled as described above.
If it demonstrates a flaw, remediation of the flaw is out of scope.

@fjetter
Copy link
Member Author

fjetter commented Aug 19, 2022

I agree with the above, this ticket should exclusively focus on ReduceReplicas. Questions I would like to be answered

  • Is enabling this by default impacting common workflows? Are any regressions/improvements detected in existing coiled-runtime benchmarks
  • Are there any relevant missing benchmarks we would like to be added to the benchmark suite
  • There should be an upstream ticket presenting the results of the benchmarking efforts to propose enabling this by default

Optional

  • The ReduceReplicas is currently "aggressively" set to one replica. I'm wondering if setting this to a less aggressive value would even accelerate us further.

@crusaderky
Copy link
Contributor

crusaderky commented Aug 20, 2022

  • The ReduceReplicas is currently "aggressively" set to one replica. I'm wondering if setting this to a less aggressive value would even accelerate us further.

It's set to

  • one replica, or
  • the number of workers with waiter tasks plus the number of waiter tasks not yet assigned to workers,

whatever is higher:
https://github.com/dask/distributed/blob/a80187a23bd317136d5dbbb02068bc8f90055e4e/distributed/active_memory_manager.py#L494-L502

@crusaderky crusaderky self-assigned this Sep 2, 2022
@crusaderky
Copy link
Contributor

crusaderky commented Sep 16, 2022

Executive summary

  • The latest A/B replication machinery with statistical analysis (Repeat A/B tests #324) suffers from major levels of background noise, which can be mitigated by throwing money at the problem through a very high number of repeat runs (>15).
    Failing that, a lower number of repeat runs (7, in this case study) will result in very noisy graphics, which could easily mislead an untrained eye and should not be shown as-is to the general public.
  • Enabling AMM ReduceReplicas across the whole spectrum of tests caused:
    • no noticeable changes in terms of runtime
    • no noticeable regressions in terms of memory usage
    • improvements up to 30% in terms of memory usage, both average and peak, in some specific use cases.

Test setup

Baseline

  • python=3.9
  • coiled-runtime=0.1.0
  • dask=2022.9.0
  • distributed=2022.9.0
  • benchmarks only (no runtime, no stability)
  • repeat every test 7 times
  • test null hypothesis

AMM

Everything as baseline, with one difference:

distributed:
  scheduler:
    active-memory-manager:
      start: true

The above enables AMM ReduceReplicas to run every 2 seconds.

Full output

https://github.com/coiled/coiled-runtime/actions/runs/3063841757

Observations on noise (unrelated to AMM)

Given the current volatility in the outcomes of the test suite, 7 runs are not enough to produce obviously clear results. The null hypothesis contains several major flares that are not supposed to be there (see pictures below). Cross-comparison with the barcharts and timeseries reports shows how these flares are caused by one-off instances where either runtime took twice as much wall clock time, twice as much memory, or both.

Examples:

Given a flare that doubles memory usage every 1~3 times over 7 repeats, it's intuitive to understand that a runtime where the flare happened 3 times will be reported in the A/B plots as a major regression when compared to a runtime where it happened only once, and that only a much higher number of repeats would smooth it out and prevent false positives.

Everything displayed in these plots is just noise:
image
image
image

AMM-specific observations

Net of noise, the runs with AMM enabled highlight some changes, which can be generally distinguished from noise by the length of their tail - a spread-out flare with a long tail is typical of a high-variance measure, whereas measures that are in solid color (very high p-value) followed by a sudden drop denote a low-variance change.

image

Enhancements

Test name Average memory (p>90%) Peak memory (p>90%)
h2o/test_hd2_benchmarks.py::test_q3[5 GB (parquet)] -10% to -15% -15% to -20%
h2o/test_hd2_benchmarks.py::test_q7[5 GB (parquet)] -10% to -15% -10% to -15%
h2o/test_hd2_benchmarks.py::test_q8[5 GB (parquet)] -25% to -30% -25% to -30%
test_array.py::test_dot_product -10% to -15% -10% to -15%
test_dataframe.py::test_shuffle -5% to -10% -5% to -10%

image
image
image
image
image

Noise

image

For test_download_throughput[pandas], refer to #339.

@crusaderky
Copy link
Contributor

crusaderky commented Sep 16, 2022

@fjetter
Copy link
Member Author

fjetter commented Sep 16, 2022

Thank you, @crusaderky ! I believe this is a pretty convincing result. Would you do us the honor and open a PR to distributed to make this official?

We can also see some significant improvements in Wall Clock for something like test_vorticity (this one is working under a lot of memory pressure and is spilling a lot)

image

@ntabris
Copy link
Member

ntabris commented Sep 16, 2022

My understanding is that in many of the benchmarks place stress on dask such that we wouldn't expect consistent performance. Is that correct? Are there cases were we would currently expect to see consistent performance but are seeing higher than expected variance / occasional flares? (Or is that not yet determined?)

@fjetter
Copy link
Member Author

fjetter commented Sep 16, 2022

My understanding is that in many of the benchmarks place stress on dask such that we wouldn't expect consistent performance. Is that correct?

Yes. For some of these tests we're addressing this in #338

Are there cases were we would currently expect to see consistent performance but are seeing higher than expected variance / occasional flares? (Or is that not yet determined?)

I think #339 is an example of "not yet understood" (or rather not confirmed, yet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants