Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc on handling worker with walltime #481

Merged
merged 3 commits into from
Jan 24, 2021

Conversation

guillaumeeb
Copy link
Member

Finally, a little contribution from me, and a doc fix to a long standing issue.

Fixes #122.

Copy link

@mivade mivade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like great additional documentation! I've pointed out some typos and some suggestions to clarify the language a bit, but otherwise this looks good!

- when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends.
- when you really don't know how long your workload will take: all your workers could be killed before reaching the end. In this case, you'll want to use adaptive clusters so that Dask ensures some workers are always up.

If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the exception.


If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases.

The solution to this problem is to tell Dask up front that the workers have a finit life time:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: finit -> finite. Similarly lifetime is usually spelled as a single word.


The solution to this problem is to tell Dask up front that the workers have a finit life time:

- Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enables -> enable


In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "every worker process runs..."

In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems.
Reaching walltime can be troublesome in several cases:

- when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopping -> hoping and "before you workload" -> "before your workload"

The solution to this problem is to tell Dask up front that the workers have a finit life time:

- Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved.
- Use `--lifetime-stagger` when dealing with many workers (say > 20): this will allow to avoid workers all terminating at the same time, and so to ease rebalancing tasks and scheduling burden.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this will allow to avoid workers all" -> "this will prevent workers from"

"and so to ease" -> "and so ease" or (probably better) "thus"

cluster.adapt(minimum=0, maximum=200)


Here is an example of a workflow taking advantage of this, if you wan't to give it a try or adapt it to your use case:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wan't -> want

@guillaumeeb
Copy link
Member Author

Many thanks @mivade! I need to practice my english...

Copy link
Member

@andersy005 andersy005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together, @guillaumeeb!

@andersy005 andersy005 added the documentation Documentation-related label Jan 24, 2021
@guillaumeeb guillaumeeb merged commit 69f27ac into dask:master Jan 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation-related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling workers with expiring allocation requests
3 participants