Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Improve UX of pytorch-elastic plugin by configuring reasonable defaults #2543

Merged
merged 12 commits into from
Jul 21, 2024

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Jul 1, 2024

Tracking issue

Closes flyteorg/flyte#5339

Why are the changes needed?

As outlined in the tracking issue, when using the pytorch elastic plugin, users almost always have to configure the following on their own:

  1. A pod template with a volume/volume mount which increases the shared memory as the default shared memory segment size of the container is very often too small when torch multiprocessing is used (e.g. for multithreaded data loaders).
  2. A higher join timeout because it easily happens that some worker pods start quicker than others, causing the rendezvous of the workers to fail due to timeouts.

What changes were proposed in this pull request?

Configure reasonable defaults to improve the UX for users:

  • Add a flag to task_config=Elastic() and PyTorch() which allows adding such a shared memory volume to the pod template. The flag defaults to true.

  • Configure reasonable join timeouts of 15 minutes.

    15 minutes was chosen as an estimate for the time difference between the startup of a pod which is immediately assigned to a running node which has the image pulled (a few seconds) and a pod which requires a node to be scaled up and the image to be pulled.

    If users require a larger timeout, they can of course increase the values but should likely rather use a gang scheduler as described here.

How was this patch tested?

  • Added unit tests
  • Ran pytorch and elastic pytorch tasks in a cluster and ensured that the volume/volume mount is added

Sentence for the release notes:

@wild-endeavor

The distributed pytorch and distributed elastic-pytorch tasks in flytekitplugins-kfpytorch by default increase the shared memory limit by mounting an emptyDir volume with medium Memory to to /dev/shm as this is almost always required when working with torch multiprocessing (e.g. multi-processed data loader workers or local worker group in distributed training). To disable this, pass increase_shared_mem=False to task_config=PyTorch/Elastic. Elastic tasks now also set a default join timeout of 15 minutes to prevent timeouts when some worker pods require a node scale-up. This setting can be modified via task_config=Elastic(rdzv_configs{...}).

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Fabio Grätz added 2 commits July 1, 2024 22:10
…emory volume

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91 fg91 self-assigned this Jul 1, 2024
@fg91 fg91 added the enhancement New feature or request label Jul 1, 2024
@fg91 fg91 changed the title Fg91/feat/kf pytorch reasonable defaults Feat: Improve UX of pytorch-elastic plugin by configuring reasonable defaults Jul 2, 2024
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91 fg91 requested a review from Future-Outlier as a code owner July 3, 2024 06:31
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91 fg91 requested a review from thomasjpfan July 3, 2024 06:44
@wild-endeavor
Copy link
Contributor

+1 from me, but will let thomas give the final go.

@wild-endeavor
Copy link
Contributor

and thanks for the blurb!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. There seems to be merge conflicts.

Fabio Grätz and others added 4 commits July 18, 2024 20:09
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91 fg91 requested a review from thomasjpfan July 18, 2024 18:36
thomasjpfan
thomasjpfan previously approved these changes Jul 18, 2024
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, otherwise LGTM

…ved if disable in task config

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91 fg91 force-pushed the fg91/feat/kf-pytorch-reasonable-defaults branch from d69c146 to d6941d5 Compare July 19, 2024 16:17
…task config to add it

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91 fg91 force-pushed the fg91/feat/kf-pytorch-reasonable-defaults branch from adc3eaa to cadd905 Compare July 19, 2024 16:27
@fg91 fg91 requested a review from thomasjpfan July 19, 2024 16:32
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fg91 fg91 merged commit efed441 into master Jul 21, 2024
46 of 47 checks passed
@kumare3
Copy link
Contributor

kumare3 commented Jul 21, 2024

This is awesome - cc @thomasjpfan @samhita-alla can we improve all our examples to drop the volume config etc

Mecoli1219 pushed a commit to Mecoli1219/flytekit that referenced this pull request Jul 27, 2024
…defaults (flyteorg#2543)

* Add flag to Elastic and PyTorch task config which configures shared memory volume

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Set reasonable default timeouts for pytorch elastic task config

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Lint

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Lint

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Add shm size upper limit to docstring

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Correct docstring: multi-threading -> multi-processing

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Add kubernetes dep

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Refactor the check of num containers as proposed in code review

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Lint

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Test that explicitly configured pod template with shm vol is not removed if disable in task config

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Raise if the user explicitly configured shm vol mount and still sets task config to add it

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

---------

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Co-authored-by: Fabio Grätz <fabiogratz@googlemail.com>
mao3267 pushed a commit to mao3267/flytekit that referenced this pull request Jul 29, 2024
…defaults (flyteorg#2543)

* Add flag to Elastic and PyTorch task config which configures shared memory volume

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Set reasonable default timeouts for pytorch elastic task config

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Lint

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Lint

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Add shm size upper limit to docstring

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Correct docstring: multi-threading -> multi-processing

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Add kubernetes dep

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Refactor the check of num containers as proposed in code review

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Lint

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Test that explicitly configured pod template with shm vol is not removed if disable in task config

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

* Raise if the user explicitly configured shm vol mount and still sets task config to add it

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>

---------

Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Co-authored-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants