-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration test that focus on AMM informing our decision to toggle this on by default #140
Comments
At the moment of writing, the AMM consists exclusively of the ReduceReplicas and the RetireWorker policies. We should scale https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085 We should also write a test with AMM completely disabled, to give us a term of comparison for performance and stability with vs. without AMM. |
Between this issue and #135, there are four use cases to be implemented:
|
DOD / ACThis story is done when the integration test portrays the behaviour of distributed on coiled as described above. |
I agree with the above, this ticket should exclusively focus on ReduceReplicas. Questions I would like to be answered
Optional
|
It's set to
whatever is higher: |
Executive summary
Test setupBaseline
AMMEverything as baseline, with one difference: distributed:
scheduler:
active-memory-manager:
start: true The above enables AMM ReduceReplicas to run every 2 seconds. Full outputhttps://github.com/coiled/coiled-runtime/actions/runs/3063841757 Observations on noise (unrelated to AMM)Given the current volatility in the outcomes of the test suite, 7 runs are not enough to produce obviously clear results. The null hypothesis contains several major flares that are not supposed to be there (see pictures below). Cross-comparison with the barcharts and timeseries reports shows how these flares are caused by one-off instances where either runtime took twice as much wall clock time, twice as much memory, or both. Examples:
Given a flare that doubles memory usage every 1~3 times over 7 repeats, it's intuitive to understand that a runtime where the flare happened 3 times will be reported in the A/B plots as a major regression when compared to a runtime where it happened only once, and that only a much higher number of repeats would smooth it out and prevent false positives. Everything displayed in these plots is just noise: AMM-specific observationsNet of noise, the runs with AMM enabled highlight some changes, which can be generally distinguished from noise by the length of their tail - a spread-out flare with a long tail is typical of a high-variance measure, whereas measures that are in solid color (very high p-value) followed by a sudden drop denote a low-variance change. Enhancements
NoiseFor test_download_throughput[pandas], refer to #339. |
Thank you, @crusaderky ! I believe this is a pretty convincing result. Would you do us the honor and open a PR to distributed to make this official? We can also see some significant improvements in Wall Clock for something like test_vorticity (this one is working under a lot of memory pressure and is spilling a lot) |
My understanding is that in many of the benchmarks place stress on dask such that we wouldn't expect consistent performance. Is that correct? Are there cases were we would currently expect to see consistent performance but are seeing higher than expected variance / occasional flares? (Or is that not yet determined?) |
Yes. For some of these tests we're addressing this in #338
I think #339 is an example of "not yet understood" (or rather not confirmed, yet) |
No description provided.
The text was updated successfully, but these errors were encountered: