Refactor Strategy Script Launcher and Deadlock handling #12797

sisilmehta2000 · 2022-04-18T17:14:55Z

Proposed refactor

Currently, if the training scripts are to be spawned from the strategy itself (as in the strategies/ddp.py):

It uses _SubprocessScriptLauncher(self.cluster_environment, self.num_processes, self.num_nodes)
It handles deadlock detection and handling for the launched processes in the strategies/ddp.py file.
We can refactor this logic to allow reuse in other strategies.

Motivation

This will enable us to share the script launch and deadlock handling logic with the FSDP Native Strategy #12447 or any other strategies in the future.

Pitch

We can take one of the following approaches

Move into parallel strategy (strategies/parallel.py)
Move into standalone libraries that are referenced in the strategy

Open to other ideas that come up in the discussion on the issue

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @tchaton @justusschock @awaelchli @Borda @akihironitta

The text was updated successfully, but these errors were encountered:

carmocca · 2022-04-19T15:23:26Z

Hi! I think this makes sense and we already wanted to pursue this direction:

Move into standalone libraries that are referenced in the strategy

Can you elaborate a bit more in-depth what's your proposal?

stale · 2022-06-06T01:38:26Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

awaelchli · 2023-01-18T22:25:47Z

See #16410 for a concrete plan how we will move forward with this feature :)

sisilmehta2000 added the needs triage Waiting to be triaged by maintainers label Apr 18, 2022

sisilmehta2000 mentioned this issue Apr 18, 2022

[FSDP] Adding Native FSDP Strategy #12447

Merged

11 tasks

carmocca added design Includes a design discussion strategy and removed needs triage Waiting to be triaged by maintainers labels Apr 19, 2022

stale bot added the won't fix This will not be worked on label Jun 6, 2022

carmocca added this to the future milestone Jun 6, 2022

stale bot removed the won't fix This will not be worked on label Jun 6, 2022

awaelchli self-assigned this Dec 26, 2022

awaelchli mentioned this issue Jan 18, 2023

Remove deadlock detection / process reconciliation logic #16204

Merged

11 tasks

awaelchli closed this as completed Jan 18, 2023

awaelchli removed this from the future milestone Jan 18, 2023

carmocca added this to the 2.0 milestone Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Strategy Script Launcher and Deadlock handling #12797

Refactor Strategy Script Launcher and Deadlock handling #12797

sisilmehta2000 commented Apr 18, 2022 •

edited by akihironitta

Loading

carmocca commented Apr 19, 2022

stale bot commented Jun 6, 2022

awaelchli commented Jan 18, 2023

Refactor Strategy Script Launcher and Deadlock handling #12797

Refactor Strategy Script Launcher and Deadlock handling #12797

Comments

sisilmehta2000 commented Apr 18, 2022 • edited by akihironitta Loading

Proposed refactor

Motivation

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

carmocca commented Apr 19, 2022

stale bot commented Jun 6, 2022

awaelchli commented Jan 18, 2023

sisilmehta2000 commented Apr 18, 2022 •

edited by akihironitta

Loading