-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc on handling worker with walltime #481
Doc on handling worker with walltime #481
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks like great additional documentation! I've pointed out some typos and some suggestions to clarify the language a bit, but otherwise this looks good!
- when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends. | ||
- when you really don't know how long your workload will take: all your workers could be killed before reaching the end. In this case, you'll want to use adaptive clusters so that Dask ensures some workers are always up. | ||
|
||
If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the exception.
|
||
If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases. | ||
|
||
The solution to this problem is to tell Dask up front that the workers have a finit life time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: finit -> finite. Similarly lifetime is usually spelled as a single word.
|
||
The solution to this problem is to tell Dask up front that the workers have a finit life time: | ||
|
||
- Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enables -> enable
|
||
In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "every worker process runs..."
In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems. | ||
Reaching walltime can be troublesome in several cases: | ||
|
||
- when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopping -> hoping and "before you workload" -> "before your workload"
The solution to this problem is to tell Dask up front that the workers have a finit life time: | ||
|
||
- Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved. | ||
- Use `--lifetime-stagger` when dealing with many workers (say > 20): this will allow to avoid workers all terminating at the same time, and so to ease rebalancing tasks and scheduling burden. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"this will allow to avoid workers all" -> "this will prevent workers from"
"and so to ease" -> "and so ease" or (probably better) "thus"
cluster.adapt(minimum=0, maximum=200) | ||
|
||
|
||
Here is an example of a workflow taking advantage of this, if you wan't to give it a try or adapt it to your use case: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wan't -> want
Many thanks @mivade! I need to practice my english... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together, @guillaumeeb!
Finally, a little contribution from me, and a doc fix to a long standing issue.
Fixes #122.