Fault-tolerant checklist #9130

awaelchli · 2021-08-26T09:27:45Z

🚀 Fault-tolerant training progress tracker

Fault-tolerant training in PL can be activated by setting the environment variable.

PL_FAULT_TOLERANT_TRAINING=1

Description of the architecture around fault-tolerant training

Open questions:

Do we need to capture torch.cuda.get_rng_state too?

Supplementary documents:
State of fault-tolerant training in PyTorch Lightning

cc @Borda @carmocca @justusschock @awaelchli @ninginthecloud

The text was updated successfully, but these errors were encountered:

tchaton · 2021-08-26T16:58:42Z

Hey @ananthsub,

As fault tolerant training is quite complex, would you like someone on your side to be involved too ?

Best,
T.C

aazzolini · 2021-08-31T05:02:54Z

Do we have a design document for the proposed work?
Fault tolerance is very easy to get wrong. We have spent a lot of time at Facebook ensuring fault tolerance in our internal toolkit so we have some idea of multiple things that could go wrong.
It would be great if we went through a careful design review for this one.

ananthsub · 2021-09-02T00:55:03Z

@awaelchli @tchaton - we should cover metrics here as well. I am not clear around the assumptions for where/when metric states are synced, whether they're synced to rank 0 before saving, and whether we assume that we save the checkpoint from rank zero only.

https://github.com/PyTorchLightning/pytorch-lightning/blob/35876bb75f27eb8f44220afd4bfa757a0432d233/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L137-L141

This does not hold for deepspeed or other use cases where we save a part of the lightning module state dict across ranks?

tchaton · 2021-09-02T07:51:42Z

Do we have a design document for the proposed work?
Fault tolerance is very easy to get wrong. We have spent a lot of time at Facebook ensuring fault tolerance in our internal toolkit so we have some idea of multiple things that could go wrong.
It would be great if we went through a careful design review for this one.

Hey @aazzolini,

I entirely agree, this is definitely very error prone ! We are working on this with @awaelchli and we will share it with you asap.
We would be happy to have recorded meeting to present the code too for ease onboarding.
We will really appreciate if Facebook could participate on this feature support, as I believe PyTorch needs improvement to make this process simpler.

Hey @ananthsub,

We didn't investigate the impact of sharding on TorchMetrics yet and added fault tolerant for non-sharded model for our fault tolerant V0.
IMO, we should have a mechanism to prevent sharding by default for Metric as each rank is accumulating their own state.

My intuition is that this assumption should still hold. Before saving, we accumulate the states on all ranks. So each ranks will contain the accumulated states. On reload, we just reset the metric on non-rank 0.

But it might become more intricate if metrics states get sharded, which might results in state collisions.

Best,
T.C

awaelchli added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 26, 2021

awaelchli added this to the v1.5 milestone Aug 26, 2021

awaelchli removed the help wanted Open to be worked on label Aug 26, 2021

This was referenced Sep 3, 2021

fix progress bar restart with fault-tolerant training enabled #9310

Merged

Resume training from the last checkpoint #5325

Closed

tchaton added the let's do it! approved to implement label Sep 9, 2021

This was referenced Sep 10, 2021

fix resuming from checkpoint for fault-tolerant in case of no failure #9371

Merged

move state extraction for CaptureMapDataset #9484

Merged

tchaton mentioned this issue Sep 13, 2021

[Feat] Add Fault Tolerant Training for ValidationLoop #9491

Merged

12 tasks

awaelchli mentioned this issue Sep 15, 2021

multiple optimizer restart with fault-tolerant training #9537

Merged

11 tasks

tchaton mentioned this issue Sep 16, 2021

[Feat] Add Fault Tolerant Training for ValidationLoop. #9563

Merged

12 tasks

awaelchli mentioned this issue Sep 17, 2021

[Feat] Add support for fault tolerance within the ValidationLoop #9564

Closed

tchaton mentioned this issue Sep 27, 2021

[Feat] Add auto_restart for fault tolerant training #9722

Merged

12 tasks

awaelchli modified the milestones: v1.5, v1.6 Nov 4, 2021

carmocca modified the milestones: 1.6, None Feb 1, 2022

carmocca added the fault tolerance label Feb 1, 2022

This was referenced Feb 7, 2022

Fault tolerant training with multiple dataloaders #11349

Closed

Add callback to manage fault-tolerance checkpoints #11862

Merged

carmocca closed this as not planned Won't fix, can't repro, duplicate, stale Mar 16, 2023

carmocca removed this from the future milestone Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault-tolerant checklist #9130

Fault-tolerant checklist #9130

awaelchli commented Aug 26, 2021 •

edited by carmocca

Loading

tchaton commented Aug 26, 2021

aazzolini commented Aug 31, 2021

ananthsub commented Sep 2, 2021 •

edited

Loading

tchaton commented Sep 2, 2021

Fault-tolerant checklist #9130

Fault-tolerant checklist #9130

Comments

awaelchli commented Aug 26, 2021 • edited by carmocca Loading

🚀 Fault-tolerant training progress tracker

tchaton commented Aug 26, 2021

aazzolini commented Aug 31, 2021

ananthsub commented Sep 2, 2021 • edited Loading

tchaton commented Sep 2, 2021

awaelchli commented Aug 26, 2021 •

edited by carmocca

Loading

ananthsub commented Sep 2, 2021 •

edited

Loading