You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, if the training scripts are to be spawned from the strategy itself (as in the strategies/ddp.py):
It uses _SubprocessScriptLauncher(self.cluster_environment, self.num_processes, self.num_nodes)
It handles deadlock detection and handling for the launched processes in the strategies/ddp.py file.
We can refactor this logic to allow reuse in other strategies.
Motivation
This will enable us to share the script launch and deadlock handling logic with the FSDP Native Strategy #12447 or any other strategies in the future.
Pitch
We can take one of the following approaches
Move into parallel strategy (strategies/parallel.py)
Move into standalone libraries that are referenced in the strategy
Open to other ideas that come up in the discussion on the issue
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
Proposed refactor
Currently, if the training scripts are to be spawned from the strategy itself (as in the strategies/ddp.py):
_SubprocessScriptLauncher(self.cluster_environment, self.num_processes, self.num_nodes)
We can refactor this logic to allow reuse in other strategies.
Motivation
This will enable us to share the script launch and deadlock handling logic with the FSDP Native Strategy #12447 or any other strategies in the future.
Pitch
We can take one of the following approaches
Open to other ideas that come up in the discussion on the issue
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @tchaton @justusschock @awaelchli @Borda @akihironitta
The text was updated successfully, but these errors were encountered: