Skip to content

Memory may not shrink fast enough #5840

@crusaderky

Description

@crusaderky

This is a follow-up from #5813.

Problem

The spill and pause thresholds, the Active Memory Manager, and rebalance() all rely on the process memory to shrink after calling PyFree.

This does not reliably happen on Windows and MacOSX; process memory remains allocated and, at the next PyMalloc call, it is reused.

The situation on Linux was substantially improved in the past by setting the MALLOC_TRIM_THRESHOLD_ environment variable (see https://distributed.dask.org/en/stable/worker.html#memory-not-released-back-to-the-os)
This does not completely remove the issue, particularly for highly fragmented memory, as flakiness in the unit tests demonstrates (see #5848).

Production impact

  • Workers may never unpause
  • When a worker hits the spill threshold, it normally spills until it is back below the target threshold. Due to this issue, however, it may instead flush everything to disk.
  • The previous point, in turn, may cause heavy data duplication due to Spill to disk may cause data duplication #3756
  • The Active Memory Manager may misbehave, erroneously freeing up workers with this actually empty allocated memory.
  • rebalance() may misbehave analogously as the AMM

Possible solutions

  • Find a way to make memory shrink down faster (jemalloc?), or
  • Find a better measure of actually used process memory

Impacted tests

  • test_worker.py::test_spill_spill_threshold
  • test_worker.py::test_spill_hysteresis (xfails on MacOS for this reason)
  • test_worker:py::test_pause_executor (seems stable now with a 400MB slab of unmanaged memory; it was flaky with 250MB)
  • test_scheduler.py::test_memory
  • All tests around rebalance() that don't force the memory measure to managed
  • Most Active Memory Manager tests

The tests are stable at the moment of writing, but they've required a lot of effort and stress testing to make so.

Issue is mitigated in the tests by

  1. using extremely large individual chunks of memory (at least 100MB, but flakiness has been observed with 250MB too)
  2. making it so the pickled output of the 100MB+ test data is less than a kB in order to prevent disk write speed from having an impact
  3. using nannies to avoid the highly unpredictable memory situation on the main process which ran all other tests so far

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions