Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Strategy Script Launcher and Deadlock handling #12797

Closed
sisilmehta2000 opened this issue Apr 18, 2022 · 3 comments
Closed

Refactor Strategy Script Launcher and Deadlock handling #12797

sisilmehta2000 opened this issue Apr 18, 2022 · 3 comments
Assignees
Labels
design Includes a design discussion strategy
Milestone

Comments

@sisilmehta2000
Copy link
Contributor

sisilmehta2000 commented Apr 18, 2022

Proposed refactor

Currently, if the training scripts are to be spawned from the strategy itself (as in the strategies/ddp.py):

  1. It uses _SubprocessScriptLauncher(self.cluster_environment, self.num_processes, self.num_nodes)
  2. It handles deadlock detection and handling for the launched processes in the strategies/ddp.py file.
    We can refactor this logic to allow reuse in other strategies.

Motivation

This will enable us to share the script launch and deadlock handling logic with the FSDP Native Strategy #12447 or any other strategies in the future.

Pitch

We can take one of the following approaches

  1. Move into parallel strategy (strategies/parallel.py)
  2. Move into standalone libraries that are referenced in the strategy

Open to other ideas that come up in the discussion on the issue

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @tchaton @justusschock @awaelchli @Borda @akihironitta

@sisilmehta2000 sisilmehta2000 added the needs triage Waiting to be triaged by maintainers label Apr 18, 2022
@carmocca carmocca added design Includes a design discussion strategy and removed needs triage Waiting to be triaged by maintainers labels Apr 19, 2022
@carmocca
Copy link
Contributor

Hi! I think this makes sense and we already wanted to pursue this direction:

Move into standalone libraries that are referenced in the strategy

Can you elaborate a bit more in-depth what's your proposal?

@stale
Copy link

stale bot commented Jun 6, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jun 6, 2022
@carmocca carmocca added this to the future milestone Jun 6, 2022
@stale stale bot removed the won't fix This will not be worked on label Jun 6, 2022
@awaelchli awaelchli self-assigned this Dec 26, 2022
@awaelchli
Copy link
Contributor

See #16410 for a concrete plan how we will move forward with this feature :)

@awaelchli awaelchli removed this from the future milestone Jan 18, 2023
@carmocca carmocca added this to the 2.0 milestone Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion strategy
Projects
None yet
Development

No branches or pull requests

3 participants