Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: non-atomic checkpoint save #27820

Merged
merged 1 commit into from
Dec 8, 2023
Merged

fix: non-atomic checkpoint save #27820

merged 1 commit into from
Dec 8, 2023

Conversation

thundergolfer
Copy link
Contributor

@thundergolfer thundergolfer commented Dec 4, 2023

What does this PR do?

transformers.trainer.Trainer currently does not do atomic checkpointing. On checkpoint save, the code creates a checkpoint directory and then progressively adds more files to it. An exception can occur for various reasons in the middle of the checkpoint save process, leaving the checkpoint directory in an invalid state.

Some examples that can result in partial checkpoint save:

  • Out of disk space
  • Pre-emption of the host
  • Serialization issues in trainer components

On restore, the trainer just looks up the latest available checkpoint by directory name. It does not check for and ignore directories with partial checkpoints.

This change adjusts the _save_checkpoint method to do all writing into a 'temporary directory' which is renamed to its final name only after all save operations have succeeded.

I don't put the temporary directory in /tmp/ because that's sometimes a different filesystem mount to where checkpoints are saved and renaming in that situation produces OSError: [Errno 18] Invalid cross-device link. I don't use shutil.move because that's not atomic for directories.

Fixes: # (didn't bother making one, just submitted PR :))

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@muellerzr, @pacman100

@thundergolfer
Copy link
Contributor Author

Test failures appear unrelated to diff:

    @classmethod
    def setUpClass(cls):
        cls.user = "huggingface-hub-ci"
        cls.token = os.getenv("HUGGINGFACE_PRODUCTION_USER_TOKEN", None)
    
        if cls.token is None:
>           raise ValueError("Cannot run tests as secret isn't setup.")
E           ValueError: Cannot run tests as secret isn't setup.

tests/test_modeling_utils.py:1178: ValueError


ERROR tests/test_modeling_utils.py::ModelOnTheFlyConversionTester::test_safetensors_on_the_fly_conversion - ValueError: Cannot run tests as secret isn't setup.
ERROR tests/test_modeling_utils.py::ModelOnTheFlyConversionTester::test_safetensors_on_the_fly_conversion_gated - ValueError: Cannot run tests as secret isn't setup.

Also, the checkpoint loading code could arguably use more sophisticated checkpoint validation and automatic fallback in cases of validation failure, but that's a follow-up issue.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thundergolfer! Overall this looks good to me, but just to be sure on the tests can you try rebasing from main? (And push with --force).

@thundergolfer
Copy link
Contributor Author

thundergolfer commented Dec 5, 2023

Thanks @muellerzr. Just rebased.

Edit: Ok that revealed an issue! Are the CI tests fail fast? I didn't see test failures related to my diff before 🤔

@muellerzr
Copy link
Contributor

@thundergolfer you should see them now. Looks like you may not be cleaning up the temp dirs right during testing? (I'd also just maybe remove the tempdir if possible after you're done needing it?)

@thundergolfer
Copy link
Contributor Author

thundergolfer commented Dec 5, 2023

@muellerzr there's actually some test cases that reveal a behavior change in this diff.

If a trainer run tries to reuse a checkpoint directory, with this diff that will now fail. I'm not sure how common this is in practice, but one of the test cases writes checkpoints 5, 10, 15, 20 and then restarts the training from checkpoint 15. Thus when the trainer gets again to the checkpoint at step 20 it fails, because my diff does not support overwriting an existing checkpoint.

For now I can adjust the change to make it compatible with non-empty destination checkpoint directories, and log a warning that the checkpoint will be non-atomic.

The alternative of deleting everything in the destination directory is destructive and could break things for people.

@thundergolfer
Copy link
Contributor Author

@muellerzr alright tests are passing now. There's still code quality CI steps failing because it doesn't find a token. Doesn't seem like something I can fix? Do I need to rebase again?

@muellerzr
Copy link
Contributor

Don't quite see the token issue, can you do pip install -e .[quality] -U; make style; make quality?

@muellerzr
Copy link
Contributor

You can also try rebasing again, as I assume this got fixed somewhere else

@thundergolfer
Copy link
Contributor Author

On the build documentation step, if I click through I see:

image

@thundergolfer
Copy link
Contributor Author

Will rebase

@thundergolfer
Copy link
Contributor Author

thundergolfer commented Dec 5, 2023

pip install -e .[quality] -U; make style; make quality works on my branch (exit 0, no diff). The CI issues appear to be some step setup problem, not an actual failure.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank a lot looks nice! 🤗

@ArthurZucker ArthurZucker merged commit 4c5ed1d into huggingface:main Dec 8, 2023
21 checks passed
@jonathanasdf
Copy link

jonathanasdf commented Dec 12, 2023

Hmm I'm getting an error running with deepspeed. Is this what #27929 is addressing?

Saving model checkpoint to /scratch/sanitycheck-20231211/tmp-checkpoint-10
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
tokenizer config file saved in /scratch/sanitycheck-20231211/tmp-checkpoint-10/tokenizer_config.json
Special tokens file saved in /scratch/sanitycheck-20231211/tmp-checkpoint-10/special_tokens_map.json
[2023-12-12 06:20:12,261] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step10 is about to be saved!
[2023-12-12 06:20:12,266] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-12-12 06:20:12,266] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-12-12 06:20:12,268] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-12-12 06:20:12,269] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-12-12 06:20:12,270] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-12-12 06:20:12,271] [INFO] [engine.py:3421:_save_zero_checkpoint] zero checkpoint saved /scratch/sanitycheck-20231211/tmp-checkpoint-10/global_step10/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-12-12 06:20:12,274] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step10 is ready now!
Traceback (most recent call last):
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer.py", line 1543, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer.py", line 1942, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2302, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2405, in _save_checkpoint
    self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
  File "/opt/venv/lib/python3.11/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
    with open(json_path, "w", encoding="utf-8") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/sanitycheck-20231211/tmp-checkpoint-10/trainer_state.json'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants