Memory may not shrink fast enough #5840

crusaderky · 2022-02-20T11:58:02Z

This is a follow-up from #5813.

Problem

The spill and pause thresholds, the Active Memory Manager, and rebalance() all rely on the process memory to shrink after calling PyFree.

This does not reliably happen on Windows and MacOSX; process memory remains allocated and, at the next PyMalloc call, it is reused.

The situation on Linux was substantially improved in the past by setting the MALLOC_TRIM_THRESHOLD_ environment variable (see https://distributed.dask.org/en/stable/worker.html#memory-not-released-back-to-the-os)
This does not completely remove the issue, particularly for highly fragmented memory, as flakiness in the unit tests demonstrates (see #5848).

Production impact

Workers may never unpause
When a worker hits the spill threshold, it normally spills until it is back below the target threshold. Due to this issue, however, it may instead flush everything to disk.
The previous point, in turn, may cause heavy data duplication due to Spill to disk may cause data duplication #3756
The Active Memory Manager may misbehave, erroneously freeing up workers with this actually empty allocated memory.
rebalance() may misbehave analogously as the AMM

Possible solutions

Find a way to make memory shrink down faster (jemalloc?), or
Find a better measure of actually used process memory

Impacted tests

test_worker.py::test_spill_spill_threshold
test_worker.py::test_spill_hysteresis (xfails on MacOS for this reason)
test_worker:py::test_pause_executor (seems stable now with a 400MB slab of unmanaged memory; it was flaky with 250MB)
test_scheduler.py::test_memory
All tests around rebalance() that don't force the memory measure to managed
Most Active Memory Manager tests

The tests are stable at the moment of writing, but they've required a lot of effort and stress testing to make so.

Issue is mitigated in the tests by

using extremely large individual chunks of memory (at least 100MB, but flakiness has been observed with 250MB too)
making it so the pickled output of the 100MB+ test data is less than a kB in order to prevent disk write speed from having an impact
using nannies to avoid the highly unpredictable memory situation on the main process which ran all other tests so far

The text was updated successfully, but these errors were encountered:

jakirkham · 2022-04-09T02:43:14Z

Curious how much remains to do here given all the work already done?

crusaderky · 2022-04-10T22:11:09Z

There was no work done so far for WIndows or MacOSX.
Linux works well if you set MALLOC_TRIM_THRESHOLD_ - which is currently not set (see #5971).

The production impact is still relevant given the above.

Of the impacted unit tests listed above, those in test_worker.py have been reworked to use mocks.
test_memory, some of the tests for rebalance(), and the tests for the AMM are still affected by the problem but they seem to be stable for the time being.

crusaderky changed the title ~~Memory may never shrink on MacOSX; workers may never unpause~~ Memory may never shrink on Windows/MacOSX; workers may never unpause Feb 20, 2022

crusaderky changed the title ~~Memory may never shrink on Windows/MacOSX; workers may never unpause~~ Memory may not shrink on Windows/MacOSX Feb 20, 2022

crusaderky mentioned this issue Feb 20, 2022

Document and test spill->target hysteresis cycle #5813

Merged

fjetter mentioned this issue Feb 22, 2022

test_spill_hysteresis flaky on ubuntu #5848

Closed

crusaderky changed the title ~~Memory may not shrink on Windows/MacOSX~~ Memory may not shrink fast enough Feb 22, 2022

This was referenced Feb 22, 2022

Work around flakyness in spill hysteresis #5850

Closed

Allow memory monitor to evict data more aggressively #3424

Merged

Mock process memory readings in test_worker.py #5870

Closed

Mock process memory readings in test_worker.py (v2) #5878

Merged

crusaderky added the memory label Mar 25, 2022

crusaderky mentioned this issue Apr 29, 2022

Unmanaged (Old) memory hanging #6232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory may not shrink fast enough #5840

Memory may not shrink fast enough #5840

crusaderky commented Feb 20, 2022 •

edited

Loading

jakirkham commented Apr 9, 2022

crusaderky commented Apr 10, 2022

Memory may not shrink fast enough #5840

Memory may not shrink fast enough #5840

Comments

crusaderky commented Feb 20, 2022 • edited Loading

Problem

Production impact

Possible solutions

Impacted tests

jakirkham commented Apr 9, 2022

crusaderky commented Apr 10, 2022

crusaderky commented Feb 20, 2022 •

edited

Loading