fix: non-atomic checkpoint save #27820

thundergolfer · 2023-12-04T00:57:07Z

What does this PR do?

transformers.trainer.Trainer currently does not do atomic checkpointing. On checkpoint save, the code creates a checkpoint directory and then progressively adds more files to it. An exception can occur for various reasons in the middle of the checkpoint save process, leaving the checkpoint directory in an invalid state.

Some examples that can result in partial checkpoint save:

Out of disk space
Pre-emption of the host
Serialization issues in trainer components

On restore, the trainer just looks up the latest available checkpoint by directory name. It does not check for and ignore directories with partial checkpoints.

This change adjusts the _save_checkpoint method to do all writing into a 'temporary directory' which is renamed to its final name only after all save operations have succeeded.

I don't put the temporary directory in /tmp/ because that's sometimes a different filesystem mount to where checkpoints are saved and renaming in that situation produces OSError: [Errno 18] Invalid cross-device link. I don't use shutil.move because that's not atomic for directories.

Fixes: # (didn't bother making one, just submitted PR :))

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerzr, @pacman100

thundergolfer · 2023-12-04T14:06:04Z

Test failures appear unrelated to diff:

    @classmethod
    def setUpClass(cls):
        cls.user = "huggingface-hub-ci"
        cls.token = os.getenv("HUGGINGFACE_PRODUCTION_USER_TOKEN", None)
    
        if cls.token is None:
>           raise ValueError("Cannot run tests as secret isn't setup.")
E           ValueError: Cannot run tests as secret isn't setup.

tests/test_modeling_utils.py:1178: ValueError


ERROR tests/test_modeling_utils.py::ModelOnTheFlyConversionTester::test_safetensors_on_the_fly_conversion - ValueError: Cannot run tests as secret isn't setup.
ERROR tests/test_modeling_utils.py::ModelOnTheFlyConversionTester::test_safetensors_on_the_fly_conversion_gated - ValueError: Cannot run tests as secret isn't setup.

Also, the checkpoint loading code could arguably use more sophisticated checkpoint validation and automatic fallback in cases of validation failure, but that's a follow-up issue.

HuggingFaceDocBuilderDev · 2023-12-04T14:31:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

muellerzr

Thanks @thundergolfer! Overall this looks good to me, but just to be sure on the tests can you try rebasing from main? (And push with --force).

thundergolfer · 2023-12-05T01:43:53Z

Thanks @muellerzr. Just rebased.

Edit: Ok that revealed an issue! Are the CI tests fail fast? I didn't see test failures related to my diff before 🤔

muellerzr · 2023-12-05T13:42:33Z

@thundergolfer you should see them now. Looks like you may not be cleaning up the temp dirs right during testing? (I'd also just maybe remove the tempdir if possible after you're done needing it?)

thundergolfer · 2023-12-05T14:49:26Z

@muellerzr there's actually some test cases that reveal a behavior change in this diff.

If a trainer run tries to reuse a checkpoint directory, with this diff that will now fail. I'm not sure how common this is in practice, but one of the test cases writes checkpoints 5, 10, 15, 20 and then restarts the training from checkpoint 15. Thus when the trainer gets again to the checkpoint at step 20 it fails, because my diff does not support overwriting an existing checkpoint.

For now I can adjust the change to make it compatible with non-empty destination checkpoint directories, and log a warning that the checkpoint will be non-atomic.

The alternative of deleting everything in the destination directory is destructive and could break things for people.

thundergolfer · 2023-12-05T16:15:20Z

@muellerzr alright tests are passing now. There's still code quality CI steps failing because it doesn't find a token. Doesn't seem like something I can fix? Do I need to rebase again?

muellerzr · 2023-12-05T16:25:56Z

Don't quite see the token issue, can you do pip install -e .[quality] -U; make style; make quality?

muellerzr · 2023-12-05T16:26:14Z

You can also try rebasing again, as I assume this got fixed somewhere else

thundergolfer · 2023-12-05T16:36:02Z

On the build documentation step, if I click through I see:

thundergolfer · 2023-12-05T16:36:08Z

Will rebase

thundergolfer · 2023-12-05T16:43:34Z

pip install -e .[quality] -U; make style; make quality works on my branch (exit 0, no diff). The CI issues appear to be some step setup problem, not an actual failure.

muellerzr

Thanks!

ArthurZucker

Thank a lot looks nice! 🤗

jonathanasdf · 2023-12-12T06:44:15Z

Hmm I'm getting an error running with deepspeed. Is this what #27929 is addressing?

Saving model checkpoint to /scratch/sanitycheck-20231211/tmp-checkpoint-10
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
tokenizer config file saved in /scratch/sanitycheck-20231211/tmp-checkpoint-10/tokenizer_config.json
Special tokens file saved in /scratch/sanitycheck-20231211/tmp-checkpoint-10/special_tokens_map.json
[2023-12-12 06:20:12,261] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step10 is about to be saved!
[2023-12-12 06:20:12,266] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-12-12 06:20:12,266] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-12-12 06:20:12,268] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-12-12 06:20:12,269] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-12-12 06:20:12,270] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-12-12 06:20:12,271] [INFO] [engine.py:3421:_save_zero_checkpoint] zero checkpoint saved /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-12-12 06:20:12,274] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now!
Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer.py", line 1942, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2302, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2405, in _save_checkpoint
    self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
    with open(json_path, "w", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/sanitycheck-20231211/tmp-checkpoint-10/trainer_state.json'

muellerzr approved these changes Dec 4, 2023

View reviewed changes

thundergolfer requested a review from muellerzr December 5, 2023 16:14

fix: non-atomic checkpoint save

9752ba5

muellerzr approved these changes Dec 5, 2023

View reviewed changes

muellerzr requested a review from ArthurZucker December 5, 2023 17:22

ArthurZucker approved these changes Dec 8, 2023

View reviewed changes

ArthurZucker merged commit 4c5ed1d into huggingface:main Dec 8, 2023
21 checks passed

thundergolfer mentioned this pull request Dec 10, 2023

fix: handle multiprocess properly in trainer checkpointing #27929

Closed

5 tasks

peter-sk mentioned this pull request Dec 12, 2023

Save model checkpoint error when multi-gpu training #27925

Closed

4 tasks

Decem-Y mentioned this pull request Dec 13, 2023

An error occurred when saving the model #27987

Closed

muellerzr mentioned this pull request Dec 13, 2023

Fix bug with rotating checkpoints #28009

Merged

5 tasks

muellerzr mentioned this pull request Feb 29, 2024

🚨 Fully revert atomic checkpointing 🚨 #29370

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: non-atomic checkpoint save #27820

fix: non-atomic checkpoint save #27820

thundergolfer commented Dec 4, 2023 •

edited

Loading

thundergolfer commented Dec 4, 2023

HuggingFaceDocBuilderDev commented Dec 4, 2023

muellerzr left a comment

thundergolfer commented Dec 5, 2023 •

edited

Loading

muellerzr commented Dec 5, 2023

thundergolfer commented Dec 5, 2023 •

edited

Loading

thundergolfer commented Dec 5, 2023

muellerzr commented Dec 5, 2023

muellerzr commented Dec 5, 2023

thundergolfer commented Dec 5, 2023

thundergolfer commented Dec 5, 2023

thundergolfer commented Dec 5, 2023 •

edited

Loading

muellerzr left a comment

ArthurZucker left a comment

jonathanasdf commented Dec 12, 2023 •

edited

Loading

fix: non-atomic checkpoint save #27820

fix: non-atomic checkpoint save #27820

Conversation

thundergolfer commented Dec 4, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

thundergolfer commented Dec 4, 2023

HuggingFaceDocBuilderDev commented Dec 4, 2023

muellerzr left a comment

Choose a reason for hiding this comment

thundergolfer commented Dec 5, 2023 • edited Loading

muellerzr commented Dec 5, 2023

thundergolfer commented Dec 5, 2023 • edited Loading

thundergolfer commented Dec 5, 2023

muellerzr commented Dec 5, 2023

muellerzr commented Dec 5, 2023

thundergolfer commented Dec 5, 2023

thundergolfer commented Dec 5, 2023

thundergolfer commented Dec 5, 2023 • edited Loading

muellerzr left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

jonathanasdf commented Dec 12, 2023 • edited Loading

thundergolfer commented Dec 4, 2023 •

edited

Loading

thundergolfer commented Dec 5, 2023 •

edited

Loading

thundergolfer commented Dec 5, 2023 •

edited

Loading

thundergolfer commented Dec 5, 2023 •

edited

Loading

jonathanasdf commented Dec 12, 2023 •

edited

Loading