Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault-tolerant checklist #9130

Closed
15 of 38 tasks
awaelchli opened this issue Aug 26, 2021 · 4 comments
Closed
15 of 38 tasks

Fault-tolerant checklist #9130

awaelchli opened this issue Aug 26, 2021 · 4 comments
Labels
fault tolerance feature Is an improvement or enhancement let's do it! approved to implement

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Aug 26, 2021

🚀 Fault-tolerant training progress tracker

Fault-tolerant training in PL can be activated by setting the environment variable.

PL_FAULT_TOLERANT_TRAINING=1

Description of the architecture around fault-tolerant training

Open questions:

  • Do we need to capture torch.cuda.get_rng_state too?

Supplementary documents:
State of fault-tolerant training in PyTorch Lightning

cc @Borda @carmocca @justusschock @awaelchli @ninginthecloud

@awaelchli awaelchli added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 26, 2021
@awaelchli awaelchli added this to the v1.5 milestone Aug 26, 2021
@awaelchli awaelchli removed the help wanted Open to be worked on label Aug 26, 2021
@tchaton
Copy link
Contributor

tchaton commented Aug 26, 2021

Hey @ananthsub,

As fault tolerant training is quite complex, would you like someone on your side to be involved too ?

Best,
T.C

@aazzolini
Copy link

Do we have a design document for the proposed work?
Fault tolerance is very easy to get wrong. We have spent a lot of time at Facebook ensuring fault tolerance in our internal toolkit so we have some idea of multiple things that could go wrong.
It would be great if we went through a careful design review for this one.

@ananthsub
Copy link
Contributor

ananthsub commented Sep 2, 2021

@awaelchli @tchaton - we should cover metrics here as well. I am not clear around the assumptions for where/when metric states are synced, whether they're synced to rank 0 before saving, and whether we assume that we save the checkpoint from rank zero only.

https://github.com/PyTorchLightning/pytorch-lightning/blob/35876bb75f27eb8f44220afd4bfa757a0432d233/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L137-L141

This does not hold for deepspeed or other use cases where we save a part of the lightning module state dict across ranks?

@tchaton
Copy link
Contributor

tchaton commented Sep 2, 2021

Do we have a design document for the proposed work?
Fault tolerance is very easy to get wrong. We have spent a lot of time at Facebook ensuring fault tolerance in our internal toolkit so we have some idea of multiple things that could go wrong.
It would be great if we went through a careful design review for this one.

Hey @aazzolini,

I entirely agree, this is definitely very error prone ! We are working on this with @awaelchli and we will share it with you asap.
We would be happy to have recorded meeting to present the code too for ease onboarding.
We will really appreciate if Facebook could participate on this feature support, as I believe PyTorch needs improvement to make this process simpler.

Hey @ananthsub,

We didn't investigate the impact of sharding on TorchMetrics yet and added fault tolerant for non-sharded model for our fault tolerant V0.
IMO, we should have a mechanism to prevent sharding by default for Metric as each rank is accumulating their own state.

My intuition is that this assumption should still hold. Before saving, we accumulate the states on all ranks. So each ranks will contain the accumulated states. On reload, we just reset the metric on non-rank 0.

But it might become more intricate if metrics states get sharded, which might results in state collisions.

Best,
T.C

@carmocca carmocca closed this as not planned Won't fix, can't repro, duplicate, stale Mar 16, 2023
@carmocca carmocca removed this from the future milestone Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fault tolerance feature Is an improvement or enhancement let's do it! approved to implement
Projects
None yet
Development

No branches or pull requests

5 participants