fix: handle multiprocess properly in trainer checkpointing #27929

thundergolfer · 2023-12-10T04:06:19Z

What does this PR do?

Follow-up to #27820 which is bugged for multi-device/multiprocess training. I made the error of thinking that in multiprocess training the ._save_checkpoint() method was already restricted to a single writer.

I've fixed that now and augmented an existing multiprocess test to validate checkpointing functionality.

I've also noted with a TODO something I found pretty confusing in the current code. store_flos() isn't checkpointing related in my opinion, but it does an all_gather and thus if all processes don't enter the store_flos() fn the training program hangs. In my opinion this code should be moved out of the checkpointing method so that this method conceptually supports entrance and execution by a single writer (the process with self.args.should_save == True).

I didn't setup a multi-GPU VM to run the test, but this multi-GPU Modal script runs and passes the test:

import modal
import subprocess

GIT_SHA = "d867b232d46a0652e1bfe6eda7bc0804b9ad5ea4" # my fork's latest commit

image = (
    modal.Image.debian_slim(python_version="3.10")
    .apt_install("git").pip_install("pytest")
    .run_commands(
        "cd /root && git init .",
        "cd /root && git remote add origin https://github.com/thundergolfer/transformers",
        f"cd /root && git fetch --depth=1 origin {GIT_SHA} && git checkout {GIT_SHA}",
        "cd /root && pip install -e \".[dev]\"",
    )
)
stub = modal.Stub(image=image)

@stub.function(
    gpu=modal.gpu.T4(count=2),
    # Can uncomment this to quickly modify local test implementation
    # and sync with remote container.
    # mounts=[modal.Mount.from_local_file(
    #     local_path="./tests/trainer/test_trainer.py",
    #     remote_path="/root/tests/trainer/test_trainer.py",
    # )],
    secrets=[modal.Secret.from_dict({"RUN_SLOW": "1", "NCCL_P2P_LEVEL": "PIX"})],
    timeout=600,
)
def run():
    subprocess.run("nvidia-smi", shell=True, check=True)
    test_module = "tests/trainer/test_trainer.py"
    test_identifier = f"{test_module}::TrainerIntegrationTest::test_end_to_end_example"
    subprocess.run(f"pytest -s -v {test_identifier}", shell=True, check=True)

Fixes #27925

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerzr, @pacman100

thundergolfer · 2023-12-10T04:11:19Z

Test failures are for "documentation" and "transformers metadata", same as last time (#27820 (comment))

muellerzr

Thanks for the fix, have you confirmed this works on a multi-GPU system?

thundergolfer · 2023-12-12T14:45:13Z

have you confirmed this works on a multi-GPU system?

Yes, that's detailed in the PR description, starting with the sentence: "I didn't setup a multi-GPU VM to run the test, ..."

Also if you agree with the TODO, I'm happy to make a follow-up PR addressing it 🙂

muellerzr · 2023-12-12T14:47:05Z

Sorry for missing it! I'll run it locally here and get back to you on if the solution indeed works. If so, yes a follow up PR for that would be great :)

amyeroberts

Thanks for fixing!

@muellerzr Once you've confirmed things work on your side happy for it to be merged :)

lzy37ld · 2023-12-13T07:55:19Z

Any Update here? Thanks!

manishiitg · 2023-12-13T10:30:16Z

any update here.? waiting for the PR merge

hieu-blackbox · 2023-12-13T12:51:13Z

I tried the changes from this PR, but I got other issue as follows:

[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=387992, OpType=_ALLGATHER_BASE, NumelIn=6291456, NumelOut=25165824, Timeout(ms)=1800000) ran for 1800865 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=387891, OpType=_ALLGATHER_BASE, NumelIn=1024, NumelOut=4096, Timeout(ms)=1800000) ran for 1801725 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=387891, OpType=_ALLGATHER_BASE, NumelIn=1024, NumelOut=4096, Timeout(ms)=1800000) ran for 1801725 milliseconds before timing out.

I realized that why we don't just check the existence of the folder like this:

        ...
        if os.path.exists(staging_output_dir):
            if self.args.should_save:
                self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
                
            if self.args.push_to_hub:
                self._push_from_checkpoint(staging_output_dir)
    
            # Place checkpoint in final location after all saving is finished.
            if staging_output_dir != output_dir:
                os.rename(staging_output_dir, output_dir)
        ...

This works smoothly.

muellerzr · 2023-12-13T15:47:11Z

src/transformers/trainer.py

+        # TODO: move out of function. This is not checkpointing, and in multi-device training
+        # involves coordination b/w processes.


This is checkpointing as we modify self.state.total_flos which gets saved as part of the checkpoint

muellerzr · 2023-12-13T15:47:54Z

src/transformers/trainer.py

        if self.hp_search_backend is None and trial is None:
            self.store_flos()

+        # Beyond this point, only a single writer should proceed.


Not necessarily, as when doing things like FSDP or DeepSpeed we save on every worker, this is not the right solution.

Damn, ok can attempt to make this safe for multiple writers.

Is there existing testing that captures FSDP and DeepSpeed multi-writer functionality?

muellerzr · 2023-12-13T16:27:26Z

@thundergolfer I have a different fix coming in that works, the issue is you were not checking that the rename of the staging folder was happening just on the main process: #28009

thundergolfer · 2023-12-13T17:14:21Z

Closing in favor of #28009 as this change still doesn't handle all multi-GPU scenarios.

fix: handle multiprocess properly in trainer checkpointing

d867b23

jonathanasdf mentioned this pull request Dec 12, 2023

fix: non-atomic checkpoint save #27820

Merged

5 tasks

peter-sk mentioned this pull request Dec 12, 2023

Save model checkpoint error when multi-gpu training #27925

Closed

4 tasks

muellerzr approved these changes Dec 12, 2023

View reviewed changes

muellerzr requested a review from ArthurZucker December 12, 2023 10:57

amyeroberts approved these changes Dec 12, 2023

View reviewed changes

muellerzr reviewed Dec 13, 2023

View reviewed changes

muellerzr mentioned this pull request Dec 13, 2023

Fix bug with rotating checkpoints #28009

Merged

5 tasks

thundergolfer closed this Dec 13, 2023

hieu-blackbox mentioned this pull request Dec 20, 2023

Save model checkpoint error when multi-gpu training still happens on 4.36.1 #28119

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle multiprocess properly in trainer checkpointing #27929

fix: handle multiprocess properly in trainer checkpointing #27929

thundergolfer commented Dec 10, 2023 •

edited

Loading

thundergolfer commented Dec 10, 2023

muellerzr left a comment

thundergolfer commented Dec 12, 2023

muellerzr commented Dec 12, 2023

amyeroberts left a comment

lzy37ld commented Dec 13, 2023

manishiitg commented Dec 13, 2023

hieu-blackbox commented Dec 13, 2023 •

edited

Loading

muellerzr Dec 13, 2023

muellerzr Dec 13, 2023

thundergolfer Dec 13, 2023

muellerzr commented Dec 13, 2023

thundergolfer commented Dec 13, 2023

		# TODO: move out of function. This is not checkpointing, and in multi-device training
		# involves coordination b/w processes.

fix: handle multiprocess properly in trainer checkpointing #27929

fix: handle multiprocess properly in trainer checkpointing #27929

Conversation

thundergolfer commented Dec 10, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

thundergolfer commented Dec 10, 2023

muellerzr left a comment

Choose a reason for hiding this comment

thundergolfer commented Dec 12, 2023

muellerzr commented Dec 12, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

lzy37ld commented Dec 13, 2023

manishiitg commented Dec 13, 2023

hieu-blackbox commented Dec 13, 2023 • edited Loading

muellerzr Dec 13, 2023

Choose a reason for hiding this comment

muellerzr Dec 13, 2023

Choose a reason for hiding this comment

thundergolfer Dec 13, 2023

Choose a reason for hiding this comment

muellerzr commented Dec 13, 2023

thundergolfer commented Dec 13, 2023

thundergolfer commented Dec 10, 2023 •

edited

Loading

hieu-blackbox commented Dec 13, 2023 •

edited

Loading