Ensure cleanup of many GBs of spilled data on terminate #6280

crusaderky · 2022-05-05T15:58:03Z

When the worker passes 95% memory, the nanny sends SIGTERM to it.
This in turn triggers the atexit callbacks, __del__ methods, and signal handlers on the worker, which delete all spilled data.
If the spilled data is many tens of GBs, however, this cleanup may take more than 200ms.
Fix a bug where the nanny bombards the worker with a new SIGTERM every 200ms, which is a very unhealthy thing to do and is likely to cause the disk not to be cleaned properly.

crusaderky · 2022-05-05T16:10:13Z

CC @ncclementi @hendrikmakait

github-actions · 2022-05-05T18:39:47Z

Unit Test Results

      16 files ±  0       16 suites ±0 7h 24m 0s ⏱️ - 21m 36s
  2 769 tests +  2   2 690 ✔️ +  4     78 💤 - 2 1 ❌ ±0
22 114 runs +16 21 092 ✔️ +16 1 020 💤 - 1 2 ❌ +1

For more details on these failures, see this check.

Results for commit 6bede03. ± Comparison against base commit 8411c2d.

♻️ This comment has been updated with latest results.

crusaderky · 2022-05-09T10:34:23Z

@ncclementi @hendrikmakait now it's green; ready for review

hendrikmakait

LGTM, love the tests. Thanks for fixing this!

hendrikmakait · 2022-05-09T13:52:16Z

distributed/worker_memory.py

        if memory / self.memory_limit > self.memory_terminate_fraction:
            logger.warning(
-                "Worker exceeded %d%% memory budget. Restarting",
-                100 * self.memory_terminate_fraction,
+                f"Worker {nanny.worker_address} (pid={process.pid}) exceeded "


Do we have any strong opinions on the use of f-strings vs. letting the logger handle the formatting?

Whatever is more readable.
For very frequent debug statements, the logger way may be slightly more performant as it could avoid converting the arguments to string if disabled. Last time I checked, it doesn't have this kind of optimization (e.g. first everything is converted to string, and then the logger figures out if it's going to go anywhere).

crusaderky self-assigned this May 5, 2022

crusaderky requested a review from ncclementi May 5, 2022 16:02

crusaderky marked this pull request as draft May 6, 2022 16:03

crusaderky force-pushed the slow_terminate branch from 0a53139 to a907175 Compare May 6, 2022 21:49

slow_terminate

ef95bda

crusaderky force-pushed the slow_terminate branch from a907175 to ef95bda Compare May 8, 2022 21:56

Revert

a4f730c

crusaderky marked this pull request as ready for review May 8, 2022 22:55

tentative fix

6bede03

hendrikmakait approved these changes May 9, 2022

View reviewed changes

crusaderky merged commit c10476f into dask:main May 9, 2022

crusaderky deleted the slow_terminate branch May 9, 2022 15:33

ncclementi mentioned this pull request May 25, 2022

Integration tests: spill/unspill coiled/benchmarks#136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure cleanup of many GBs of spilled data on terminate #6280

Ensure cleanup of many GBs of spilled data on terminate #6280

crusaderky commented May 5, 2022 •

edited

Loading

crusaderky commented May 5, 2022

github-actions bot commented May 5, 2022 •

edited

Loading

crusaderky commented May 9, 2022

hendrikmakait left a comment

hendrikmakait May 9, 2022

crusaderky May 9, 2022 •

edited

Loading

Ensure cleanup of many GBs of spilled data on terminate #6280

Ensure cleanup of many GBs of spilled data on terminate #6280

Conversation

crusaderky commented May 5, 2022 • edited Loading

crusaderky commented May 5, 2022

github-actions bot commented May 5, 2022 • edited Loading

Unit Test Results

crusaderky commented May 9, 2022

hendrikmakait left a comment

Choose a reason for hiding this comment

hendrikmakait May 9, 2022

Choose a reason for hiding this comment

crusaderky May 9, 2022 • edited Loading

Choose a reason for hiding this comment

crusaderky commented May 5, 2022 •

edited

Loading

github-actions bot commented May 5, 2022 •

edited

Loading

crusaderky May 9, 2022 •

edited

Loading