Allow training to resume even if RNG states are not properly loaded #14994

sgugger · 2021-12-30T21:40:02Z

What does this PR do?

This PR allows training to resume even if the loading of the RNG state fail in multi-GPU DataParallel mode because less GPUs are used than during the original training.

Fixes #14554

…uggingface#14994) * Allow training to resume even if RNG states are not properly loaded * Proper f-string

sgugger requested a review from LysandreJik December 30, 2021 21:40

sgugger added 2 commits December 30, 2021 16:40

Allow training to resume even if RNG states are not properly loaded

fee2472

Proper f-string

ebdb910

sgugger merged commit e68c375 into master Dec 30, 2021

sgugger deleted the fix_rng_multigpu branch December 30, 2021 22:03

stevhliu pushed a commit to stevhliu/transformers that referenced this pull request Jan 6, 2022

Allow training to resume even if RNG states are not properly loaded (h…

2f79c04

…uggingface#14994) * Allow training to resume even if RNG states are not properly loaded * Proper f-string

muzhi1991 mentioned this pull request Feb 15, 2022

fix bug for the log of RNG states are not properly loaded lead to exception. #15638

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow training to resume even if RNG states are not properly loaded #14994

Allow training to resume even if RNG states are not properly loaded #14994

sgugger commented Dec 30, 2021

Allow training to resume even if RNG states are not properly loaded #14994

Allow training to resume even if RNG states are not properly loaded #14994

Conversation

sgugger commented Dec 30, 2021

What does this PR do?