Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle multiprocess properly in trainer checkpointing #27929

Closed
wants to merge 1 commit into from
Closed

fix: handle multiprocess properly in trainer checkpointing #27929

wants to merge 1 commit into from

Conversation

thundergolfer
Copy link
Contributor

@thundergolfer thundergolfer commented Dec 10, 2023

What does this PR do?

Follow-up to #27820 which is bugged for multi-device/multiprocess training. I made the error of thinking that in multiprocess training the ._save_checkpoint() method was already restricted to a single writer.

I've fixed that now and augmented an existing multiprocess test to validate checkpointing functionality.

I've also noted with a TODO something I found pretty confusing in the current code. store_flos() isn't checkpointing related in my opinion, but it does an all_gather and thus if all processes don't enter the store_flos() fn the training program hangs. In my opinion this code should be moved out of the checkpointing method so that this method conceptually supports entrance and execution by a single writer (the process with self.args.should_save == True).

I didn't setup a multi-GPU VM to run the test, but this multi-GPU Modal script runs and passes the test:

import modal
import subprocess

GIT_SHA = "d867b232d46a0652e1bfe6eda7bc0804b9ad5ea4" # my fork's latest commit

image = (
    modal.Image.debian_slim(python_version="3.10")
    .apt_install("git").pip_install("pytest")
    .run_commands(
        "cd /root && git init .",
        "cd /root && git remote add origin https://github.com/thundergolfer/transformers",
        f"cd /root && git fetch --depth=1 origin {GIT_SHA} && git checkout {GIT_SHA}",
        "cd /root && pip install -e \".[dev]\"",
    )
)
stub = modal.Stub(image=image)

@stub.function(
    gpu=modal.gpu.T4(count=2),
    # Can uncomment this to quickly modify local test implementation
    # and sync with remote container.
    # mounts=[modal.Mount.from_local_file(
    #     local_path="./tests/trainer/test_trainer.py",
    #     remote_path="/root/tests/trainer/test_trainer.py",
    # )],
    secrets=[modal.Secret.from_dict({"RUN_SLOW": "1", "NCCL_P2P_LEVEL": "PIX"})],
    timeout=600,
)
def run():
    subprocess.run("nvidia-smi", shell=True, check=True)
    test_module = "tests/trainer/test_trainer.py"
    test_identifier = f"{test_module}::TrainerIntegrationTest::test_end_to_end_example"
    subprocess.run(f"pytest -s -v {test_identifier}", shell=True, check=True)

Fixes #27925

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@muellerzr, @pacman100

@thundergolfer
Copy link
Contributor Author

Test failures are for "documentation" and "transformers metadata", same as last time (#27820 (comment))

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, have you confirmed this works on a multi-GPU system?

@thundergolfer
Copy link
Contributor Author

have you confirmed this works on a multi-GPU system?

Yes, that's detailed in the PR description, starting with the sentence: "I didn't setup a multi-GPU VM to run the test, ..."

Also if you agree with the TODO, I'm happy to make a follow-up PR addressing it 🙂

@muellerzr
Copy link
Contributor

Sorry for missing it! I'll run it locally here and get back to you on if the solution indeed works. If so, yes a follow up PR for that would be great :)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

@muellerzr Once you've confirmed things work on your side happy for it to be merged :)

@lzy37ld
Copy link

lzy37ld commented Dec 13, 2023

Any Update here? Thanks!

@manishiitg
Copy link

any update here.? waiting for the PR merge

@hieu-blackbox
Copy link

hieu-blackbox commented Dec 13, 2023

I tried the changes from this PR, but I got other issue as follows:

[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=387992, OpType=_ALLGATHER_BASE, NumelIn=6291456, NumelOut=25165824, Timeout(ms)=1800000) ran for 1800865 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=387891, OpType=_ALLGATHER_BASE, NumelIn=1024, NumelOut=4096, Timeout(ms)=1800000) ran for 1801725 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=387891, OpType=_ALLGATHER_BASE, NumelIn=1024, NumelOut=4096, Timeout(ms)=1800000) ran for 1801725 milliseconds before timing out.

I realized that why we don't just check the existence of the folder like this:

        ...
        if os.path.exists(staging_output_dir):
            if self.args.should_save:
                self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
                
            if self.args.push_to_hub:
                self._push_from_checkpoint(staging_output_dir)
    
            # Place checkpoint in final location after all saving is finished.
            if staging_output_dir != output_dir:
                os.rename(staging_output_dir, output_dir)
        ...

This works smoothly.

Comment on lines +2334 to +2335
# TODO: move out of function. This is not checkpointing, and in multi-device training
# involves coordination b/w processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checkpointing as we modify self.state.total_flos which gets saved as part of the checkpoint

if self.hp_search_backend is None and trial is None:
self.store_flos()

# Beyond this point, only a single writer should proceed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, as when doing things like FSDP or DeepSpeed we save on every worker, this is not the right solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, ok can attempt to make this safe for multiple writers.

Is there existing testing that captures FSDP and DeepSpeed multi-writer functionality?

@muellerzr
Copy link
Contributor

@thundergolfer I have a different fix coming in that works, the issue is you were not checking that the rename of the staging folder was happening just on the main process: #28009

@thundergolfer
Copy link
Contributor Author

Closing in favor of #28009 as this change still doesn't handle all multi-GPU scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants