-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Improve UX of pytorch-elastic plugin by configuring reasonable defaults #2543
Conversation
…emory volume Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
+1 from me, but will let thomas give the final go. |
and thanks for the blurb! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. There seems to be merge conflicts.
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, otherwise LGTM
…ved if disable in task config Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
d69c146
to
d6941d5
Compare
…task config to add it Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
adc3eaa
to
cadd905
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is awesome - cc @thomasjpfan @samhita-alla can we improve all our examples to drop the volume config etc |
…defaults (flyteorg#2543) * Add flag to Elastic and PyTorch task config which configures shared memory volume Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Set reasonable default timeouts for pytorch elastic task config Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Lint Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Lint Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Add shm size upper limit to docstring Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Correct docstring: multi-threading -> multi-processing Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Add kubernetes dep Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Refactor the check of num containers as proposed in code review Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Lint Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Test that explicitly configured pod template with shm vol is not removed if disable in task config Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Raise if the user explicitly configured shm vol mount and still sets task config to add it Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> --------- Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> Co-authored-by: Fabio Grätz <fabiogratz@googlemail.com>
…defaults (flyteorg#2543) * Add flag to Elastic and PyTorch task config which configures shared memory volume Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Set reasonable default timeouts for pytorch elastic task config Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Lint Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Lint Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Add shm size upper limit to docstring Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Correct docstring: multi-threading -> multi-processing Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Add kubernetes dep Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Refactor the check of num containers as proposed in code review Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Lint Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Test that explicitly configured pod template with shm vol is not removed if disable in task config Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> * Raise if the user explicitly configured shm vol mount and still sets task config to add it Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> --------- Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com> Co-authored-by: Fabio Grätz <fabiogratz@googlemail.com> Signed-off-by: mao3267 <chenvincent610@gmail.com>
Tracking issue
Closes flyteorg/flyte#5339
Why are the changes needed?
As outlined in the tracking issue, when using the pytorch elastic plugin, users almost always have to configure the following on their own:
What changes were proposed in this pull request?
Configure reasonable defaults to improve the UX for users:
Add a flag to
task_config=Elastic()
andPyTorch()
which allows adding such a shared memory volume to the pod template. The flag defaults to true.Configure reasonable join timeouts of 15 minutes.
15 minutes was chosen as an estimate for the time difference between the startup of a pod which is immediately assigned to a running node which has the image pulled (a few seconds) and a pod which requires a node to be scaled up and the image to be pulled.
If users require a larger timeout, they can of course increase the values but should likely rather use a gang scheduler as described here.
How was this patch tested?
Sentence for the release notes:
@wild-endeavor
Check all the applicable boxes