Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure cleanup of many GBs of spilled data on terminate #6280

Merged
merged 3 commits into from
May 9, 2022

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented May 5, 2022

When the worker passes 95% memory, the nanny sends SIGTERM to it.
This in turn triggers the atexit callbacks, __del__ methods, and signal handlers on the worker, which delete all spilled data.
If the spilled data is many tens of GBs, however, this cleanup may take more than 200ms.
Fix a bug where the nanny bombards the worker with a new SIGTERM every 200ms, which is a very unhealthy thing to do and is likely to cause the disk not to be cleaned properly.

@crusaderky crusaderky self-assigned this May 5, 2022
@crusaderky crusaderky requested a review from ncclementi May 5, 2022 16:02
@crusaderky
Copy link
Collaborator Author

CC @ncclementi @hendrikmakait

@github-actions
Copy link
Contributor

github-actions bot commented May 5, 2022

Unit Test Results

       16 files  ±  0         16 suites  ±0   7h 24m 0s ⏱️ - 21m 36s
  2 769 tests +  2    2 690 ✔️ +  4       78 💤  - 2  1 ±0 
22 114 runs  +16  21 092 ✔️ +16  1 020 💤  - 1  2 +1 

For more details on these failures, see this check.

Results for commit 6bede03. ± Comparison against base commit 8411c2d.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky marked this pull request as draft May 6, 2022 16:03
@crusaderky crusaderky marked this pull request as ready for review May 8, 2022 22:55
@crusaderky
Copy link
Collaborator Author

@ncclementi @hendrikmakait now it's green; ready for review

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, love the tests. Thanks for fixing this!

if memory / self.memory_limit > self.memory_terminate_fraction:
logger.warning(
"Worker exceeded %d%% memory budget. Restarting",
100 * self.memory_terminate_fraction,
f"Worker {nanny.worker_address} (pid={process.pid}) exceeded "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any strong opinions on the use of f-strings vs. letting the logger handle the formatting?

Copy link
Collaborator Author

@crusaderky crusaderky May 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever is more readable.
For very frequent debug statements, the logger way may be slightly more performant as it could avoid converting the arguments to string if disabled. Last time I checked, it doesn't have this kind of optimization (e.g. first everything is converted to string, and then the logger figures out if it's going to go anywhere).

@crusaderky crusaderky merged commit c10476f into dask:main May 9, 2022
@crusaderky crusaderky deleted the slow_terminate branch May 9, 2022 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants