-
-
Notifications
You must be signed in to change notification settings - Fork 750
Description
This is a follow-up from #5813.
Problem
The spill and pause thresholds, the Active Memory Manager, and rebalance() all rely on the process memory to shrink after calling PyFree.
This does not reliably happen on Windows and MacOSX; process memory remains allocated and, at the next PyMalloc call, it is reused.
The situation on Linux was substantially improved in the past by setting the MALLOC_TRIM_THRESHOLD_ environment variable (see https://distributed.dask.org/en/stable/worker.html#memory-not-released-back-to-the-os)
This does not completely remove the issue, particularly for highly fragmented memory, as flakiness in the unit tests demonstrates (see #5848).
Production impact
- Workers may never unpause
- When a worker hits the spill threshold, it normally spills until it is back below the target threshold. Due to this issue, however, it may instead flush everything to disk.
- The previous point, in turn, may cause heavy data duplication due to Spill to disk may cause data duplication #3756
- The Active Memory Manager may misbehave, erroneously freeing up workers with this actually empty allocated memory.
- rebalance() may misbehave analogously as the AMM
Possible solutions
- Find a way to make memory shrink down faster (jemalloc?), or
- Find a better measure of actually used process memory
Impacted tests
test_worker.py::test_spill_spill_thresholdtest_worker.py::test_spill_hysteresis(xfails on MacOS for this reason)test_worker:py::test_pause_executor(seems stable now with a 400MB slab of unmanaged memory; it was flaky with 250MB)test_scheduler.py::test_memory- All tests around rebalance() that don't force the memory measure to managed
- Most Active Memory Manager tests
The tests are stable at the moment of writing, but they've required a lot of effort and stress testing to make so.
Issue is mitigated in the tests by
- using extremely large individual chunks of memory (at least 100MB, but flakiness has been observed with 250MB too)
- making it so the pickled output of the 100MB+ test data is less than a kB in order to prevent disk write speed from having an impact
- using nannies to avoid the highly unpredictable memory situation on the main process which ran all other tests so far