-
-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart paused workers after a certain timeout #5999
Comments
Implementing this will have unpleasant consequences when the failure to get out of paused state is not due to memory leaks, but due to running out of disk space. e.g.
Today, the cluster is deadlocked.
A timeout to retirement is being discussed in #6252. It would not help in this use case, as workers would be in and out of retiring state with some periodicity. |
This ticket heavily interacts with: |
I know this issue has been open for a while now, apologies for waiting so long to reply. I'm a bit hesitant to pull the trigger on this feature. I'm aware of two sources of significant unmanaged memory
If either of these sources would flood available RSS, the worker would need to be restarted to free up available memory, ideally gracefully. I'm working under the assumption that 2.) is fixed by #6780 which leaves 1.) which I'm not entirely convinced justifies implementing this feature. For all other cases where unmanaged memory is not the cause for RSS to be full but managed memory + disk, I don't think restarting is the best approach. I believe this is what you are outlining in #5999 (comment) but I have to admit that I didn't fully understand the context of this scenario. I acknowledge that if all workers are full of memory and paused, regardless of what the root cause of the memory is, the cluster would deadlock. However, this would be a situation we could deal with by for instance issuing a warning on the client or even failing the graph, etc. (the scheduler knows if all workers are paused) |
Use case
Workers pause once they hit
distributed.worker.memory.pause
threshold (default 80%). This is to allow a worker to spill managed data to disk to free up memory to consume computation; in other words when the tasks are producing managed data faster than they can be spilled to the backing disk.However, if memory is full due to unmanaged memory, the worker will never be unpaused and remain a deadweight forever. In real life, this typically happens when a library leaks memory.
Design
If a worker remains paused beyond a certain user-defined timeout, restart it through graceful worker retirement. This implies moving all of its managed data out of the spill file and to other workers.
[EDIT] Ideally, this timeout should start ticking either after all managed data has been spilled to disk or data failed to spill (due to disk full or max_spill).
AC
Caveats
Rejected ideas
Persistent spill buffer which is retrieved after restart. This would prevent the above caveat.
Related
The text was updated successfully, but these errors were encountered: