-
-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure cleanup of many GBs of spilled data on terminate #6280
Conversation
Unit Test Results 16 files ± 0 16 suites ±0 7h 24m 0s ⏱️ - 21m 36s For more details on these failures, see this check. Results for commit 6bede03. ± Comparison against base commit 8411c2d. ♻️ This comment has been updated with latest results. |
@ncclementi @hendrikmakait now it's green; ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, love the tests. Thanks for fixing this!
if memory / self.memory_limit > self.memory_terminate_fraction: | ||
logger.warning( | ||
"Worker exceeded %d%% memory budget. Restarting", | ||
100 * self.memory_terminate_fraction, | ||
f"Worker {nanny.worker_address} (pid={process.pid}) exceeded " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any strong opinions on the use of f-strings vs. letting the logger handle the formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever is more readable.
For very frequent debug statements, the logger way may be slightly more performant as it could avoid converting the arguments to string if disabled. Last time I checked, it doesn't have this kind of optimization (e.g. first everything is converted to string, and then the logger figures out if it's going to go anywhere).
When the worker passes 95% memory, the nanny sends SIGTERM to it.
This in turn triggers the atexit callbacks,
__del__
methods, and signal handlers on the worker, which delete all spilled data.If the spilled data is many tens of GBs, however, this cleanup may take more than 200ms.
Fix a bug where the nanny bombards the worker with a new SIGTERM every 200ms, which is a very unhealthy thing to do and is likely to cause the disk not to be cleaned properly.