Save model checkpoint error when multi-gpu training #27925

Cospui · 2023-12-09T16:18:07Z

System Info

transformers version: 4.36.0.dev0
Platform: Linux-6.2.0-1017-azure-x86_64-with-glibc2.35
Python version: 3.10.13
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.0
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help?

@muellerzr and @pacman100 I found when launch the example trainer code with multi-nodes, the code will raise a FileNotFound error when saving the checkpoint, and after debug, I think the reason is in trainer.py L2382:

        if staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

When one process rename the folder, and other processes will encounter the FileNotFound error. Maybe one can modify the code like this to avoid the error:

        if self.args.should_save and staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Run the MAE training code from the example folder.

Expected behavior

Solve the FileNotFound error.

The text was updated successfully, but these errors were encountered:

jquesnelle · 2023-12-12T03:05:42Z

I had this same issue, I temporarily fixed it by neutering the different staging directory:

if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
    logger.warning(
        f"Checkpoint destination directory {output_dir} already exists and is non-empty."
        "Saving will proceed but saved results may be invalid."
    )
    staging_output_dir = output_dir
else:
    # staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
    staging_output_dir = output_dir

staticpunch · 2023-12-12T06:51:52Z

I had this same issue, I temporarily fixed it by neutering the different staging directory:

if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
    logger.warning(
        f"Checkpoint destination directory {output_dir} already exists and is non-empty."
        "Saving will proceed but saved results may be invalid."
    )
    staging_output_dir = output_dir
else:
    # staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
    staging_output_dir = output_dir

Where did you insert this?

Andcircle · 2023-12-12T07:14:49Z

Facing same issue in multi-node training:
File "/home/user/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2353, in _save_checkpoint self.save_model(staging_output_dir, _internal_call=True) RuntimeError: Parent directory tmp-checkpoint-200 does not exist.
It added annoying tmp- in front of the checkpoint

peter-sk · 2023-12-12T10:40:17Z

This is a showstopper for training on multi-GPU nodes. The culprit seems to be the following merged PR #27820.

peter-sk · 2023-12-12T10:41:30Z

There is an open PR #27929, which seems to fix the issue.
@ArthurZucker @sgugger @younesbelkada

muellerzr · 2023-12-13T16:28:40Z

Hi all, can you please do pip install git+https://github.com/huggingface/transformers and rerun your code? This should fix your issue now.

Thank you very much for your patience and flagging this!

hahmad2008 · 2023-12-19T19:39:57Z

@muellerzr @thundergolfer I still get the same issue of saving checkpoint using the latest version of transformers 4.36 and even with ‘4.37.0.dev0’

I used three workers each one has two GPUs, I tried fine-tuning to be saved on a shared storage and a non-shared storage, and for both cases I still got the same error!

FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'

File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
  return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
  self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _maybe_log_save_evaluate
  self._save_checkpoint(model, trial, metrics=metrics)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2395, in _save_checkpoint
  os.rename(staging_output_dir, output_dir)
FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'

although the model/checkpoint-49 is already created!

muellerzr · 2023-12-21T15:46:16Z

@hahmad2008 can you try doing either pip install transformers -U or reinstall from git? From the line numbers it's not adding up that you're using a version that includes the fix

tblattner · 2023-12-21T19:37:06Z

I encountered this issue with the trainer with the following command-line. This was after recently updating transformers with pip install transformers --upgrade

--save_strategy epoch --save_total_limit 1

transformers==4.36.2

Edit:
One thing to note this was with 2 nodes with 8x A100s per node.
Looking at the code around the error, I have a feeling this was because I may have used local=True when using with main_process_first. Going to try disabling save_on_each_node.

        if staging_output_dir != output_dir:
            with self.args.main_process_first(
                desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
            ):
                if os.path.exists(staging_output_dir):
                    os.rename(staging_output_dir, output_dir)

edit edit:
Looks like its still not working even when specifying save_on_each_node to false.

Here is the full command, launched from a slurm sbatch job:

srun --kill-on-bad-exit=1 --jobid $SLURM_JOB_ID bash -c "accelerate launch --use_deepspeed --zero_stage 1 --deepspeed_hostfile hostfile --deepspeed_multinode_launcher openmpi --gradient_accumulation_steps 1 --num_processes $(( $NUM_GPUS * $COUNT_NODE )) --num_machines $COUNT_NODE --num_cpu_threads_per_process $CPU_COUNT --mixed_precision bf16 --machine_rank \$SLURM_PROCID --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT main.py --source_datasets_filepath source_data/clm --output_dir testing_output_cluster --model_number 2 --overwrite_output_dir --dataloader_num_workers 10 --bf16 --data_fraction 0.1 --save_strategy steps --save_total_limit 1 --save_on_each_node false --dataloader_num_workers 2 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --max_token_length 1024 --num_train_epochs 1"

snowyday · 2023-12-26T07:35:17Z

I encountered a similar error when using the trainer from DeepSpeed.
The error occurs at the exact moment after if os.path.exists(staging_output_dir): is evaluated and another process finishes renaming.

I had no other choice, so I resorted to using a try block to get around it.

if staging_output_dir != output_dir:
    with self.args.main_process_first(
        desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
    ):
        if os.path.exists(staging_output_dir):
            try:
                os.rename(staging_output_dir, output_dir)
            except Exception as e:
                logger.info(f"Could not rename checkpoint directory from {staging_output_dir} to {output_dir}. Reason: {e}")

transformers-4.37.0.dev0

xk-huang · 2023-12-27T15:28:26Z

Hi, @snowyday , @tblattner , and @muellerzr . I think main_process_first may be broken.

I run the trainer with 2 nodes X 8 V100 GPUs and deepspeed. When I turned on log_level=debug, I found that only one process entered the waiting mode, while all other processes tried to save the checkpoint.

The log from process that waited:

[DEBUG|training_args.py:2119] 2023-12-27 15:11:30,917 >> 4: waiting for the main process to perform Renaming model checkpoint folder to true location

peter-sk · 2023-12-29T00:28:37Z

I also encounter this with 4.36.2 and HEAD in a multi-node multi-GPU setup. Looks like an obvious race condition, as it happens indeterminately (sometimes 2nd save, sometimes 7th save etc).

lzy37ld · 2024-01-01T10:06:15Z

Hi Any update or final conclusion here? :>

roynirmal · 2024-01-02T02:29:57Z

any solutions? facing the same issue on multinode training using deepspeed

luvwinnie · 2024-01-04T00:35:51Z

same here, any solutions?

snowyday · 2024-01-04T01:34:01Z

I've been using a try-except approach for bypassing the issue, and it's been working well for me. However, as xk-huang mentioned, it seems that the root cause is that self.args.main_process_first is not handling multi-node setups properly.

tblattner · 2024-01-04T02:26:08Z

Curious if there is any reason why we must do os.path.exists and os.rename for each process, why not just the main process(es)?

Haven't tested this code yet as my compute resources are currently filled and I have a long-running experiment set to finish in a couple days, but wanted to get some thoughts on this potential solution.

        # Only rename from main process to avoid race condition from other processes especially for distributed filesystems
        if staging_output_dir != output_dir:
            if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:
                if os.path.exists(staging_output_dir):
                    os.rename(staging_output_dir, output_dir)

            self.args.distributed_state.wait_for_everyone()

luvwinnie · 2024-01-04T06:06:46Z

I'm using transformers's Trainer, is there any work around for this?

luvwinnie · 2024-01-04T06:22:09Z

For work around with Trainer, I just subclassed it and replace the _save_checkpoint method that added try exception.

class CustomTrainer(Trainer):
    def _save_checkpoint(self, model, trial, metrics=None):
        # In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we
        # want to save except FullyShardedDDP.
        # assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"

        # Save model checkpoint
        checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

        if self.hp_search_backend is None and trial is None:
            self.store_flos()

        run_dir = self._get_output_dir(trial=trial)
        output_dir = os.path.join(run_dir, checkpoint_folder)
        if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
            logger.warning(
                f"Checkpoint destination directory {output_dir} already exists and is non-empty."
                "Saving will proceed but saved results may be invalid."
            )
            staging_output_dir = output_dir
        else:
            staging_output_dir = os.path.join(
                run_dir, f"tmp-{checkpoint_folder}")
        self.save_model(staging_output_dir, _internal_call=True)

        if not self.args.save_only_model:
            # Save optimizer and scheduler
            self._save_optimizer_and_scheduler(staging_output_dir)
            # Save RNG state
            self._save_rng_state(staging_output_dir)

        # Determine the new best metric / best model checkpoint
        if metrics is not None and self.args.metric_for_best_model is not None:
            metric_to_check = self.args.metric_for_best_model
            if not metric_to_check.startswith("eval_"):
                metric_to_check = f"eval_{metric_to_check}"
            metric_value = metrics[metric_to_check]

            operator = np.greater if self.args.greater_is_better else np.less
            if (
                self.state.best_metric is None
                or self.state.best_model_checkpoint is None
                or operator(metric_value, self.state.best_metric)
            ):
                self.state.best_metric = metric_value
                self.state.best_model_checkpoint = output_dir

        # Save the Trainer state
        if self.args.should_save:
            self.state.save_to_json(os.path.join(
                staging_output_dir, TRAINER_STATE_NAME))

        if self.args.push_to_hub:
            self._push_from_checkpoint(staging_output_dir)

        # Place checkpoint in final location after all saving is finished.
        # First wait for everyone to finish writing
        self.args.distributed_state.wait_for_everyone()
        # Then go through the rewriting process starting on process 0
        try:
            if staging_output_dir != output_dir:
                with self.args.main_process_first(
                    desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
                ):
                    if os.path.exists(staging_output_dir):
                        os.rename(staging_output_dir, output_dir)

            # Maybe delete some older checkpoints.
            if self.args.should_save:
                self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
        except Exception:
            print("Error rotating checkpoints skipping")
            pass

snowyday · 2024-01-04T06:39:06Z

I've checked the main_process_first using the code snippet below:
Number of nodes: 3
Processes per node (GPUs): 4
Total: 12 processes

import logging

import deepspeed
import transformers
import torch


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

if __name__ == "__main__":
    deepspeed.init_distributed()
    node_rank = torch.distributed.get_rank()   
    training_args = transformers.TrainingArguments(per_device_train_batch_size=8,
                                                   gradient_accumulation_steps=2,
                                                   num_train_epochs=3,
                                                   deepspeed="ds_config/ds_config_zero3.json",
                                                   output_dir="logs")

    with training_args.main_process_first():
        logger.info(f"Check `main_process_first`. Node rank {node_rank}")

Address family not supported by protocol).
[INFO:root:Check `main_process_first`. Node rank 8
INFO:root:Check `main_process_first`. Node rank 0
INFO:root:Check `main_process_first`. Node rank 4
INFO:root:Check `main_process_first`. Node rank 6
INFO:root:Check `main_process_first`. Node rank 10
INFO:root:Check `main_process_first`. Node rank 5
INFO:root:Check `main_process_first`. Node rank 9
INFO:root:Check `main_process_first`. Node rank 1
INFO:root:Check `main_process_first`. Node rank 2
INFO:root:Check `main_process_first`. Node rank 3
INFO:root:Check `main_process_first`. Node rank 7
INFO:root:Check `main_process_first`. Node rank 11

The node rankings appear to be correctly allocated, with Node rank 0 going to node 1, Node rank 4 to node 2, and Node rank 8 to node 3; however, there are inaccuracies with the global rankings. In the context of a shared filesystem, if we proceed without waiting for the result from global rank 0, it could cause conflicts during the os.rename operation.

if staging_output_dir != output_dir:
    with self.args.main_process_first(
        desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
    ):
        if os.path.exists(staging_output_dir):
            os.rename(staging_output_dir, output_dir)

snowyday · 2024-01-04T16:14:26Z

In the case of processes sharing a filesystem, it seems prudent for only one process to wait for a rename operation to complete. However, why main_process_first is being used? On a shared filesystem, if the rename() fails, options are limited. Is this why multiple processes are making repeated attempts?

tblattner · 2024-01-04T16:31:44Z

I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.

My suggestion is to use something like this:
if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:

Then self.args.distributed_state.wait_for_everyone() to synchronize everyone afterwards.

This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...

It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.

muellerzr · 2024-01-05T15:10:16Z

It is, so we could have a race condition. An fsync could be done certainly and your logic makes sense. @tblattner would you like to open a PR on this by chance?

mjbommar · 2024-01-05T16:02:16Z

FYI, we tested and also experienced this without shared FS (accelerate/pdsh, simple two-node setup).

Also, if we rely on full fsync implementation in checkpoint folder, it might be good to explicitly call that out in docs as not all filesystems/mount options will fail hard on "fake" fsync calls.

tblattner · 2024-01-05T18:05:56Z

It is, so we could have a race condition. An fsync could be done certainly and your logic makes sense. @tblattner would you like to open a PR on this by chance?

I can get a start on a PR. Not sure what the best methodology for running fsync on a rename operation is, but I'll give it a shot.

yuleiqin · 2024-01-21T00:47:32Z

I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.

My suggestion is to use something like this: if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:

Then self.args.distributed_state.wait_for_everyone() to synchronize everyone afterwards.

This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...

It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.

That's very nice of you to add "self.args.distributed_state.wait_for_everyone()" and I found that after saving the model checkpoint, it is sometimes probable to see:
[Watchdog]() caught collective operation timeout: WorkNCCL(SeqNum=292968, OpType=_ALLGATHER_BASE, NumelIn=1882369, NumelOut=45176856.

MaxGonzalezSaez-Diez · 2024-02-21T22:43:50Z

any updates?

ArthurZucker · 2024-02-23T09:01:44Z

This was fixed by the PR I believe !

snowyday · 2024-02-23T13:03:49Z

A similar error has now occurred at L.2561 89c6481

I am experiencing this issue in a distributed training environment that utilizes a shared file system across 16 nodes, with each node equipped with 4 GPUs. I'm deploying the training using DeepSpeed's OpenMPI launcher.

In this setup, I have observed scenarios where the cleanup command shutil.rmtree(staging_output_dir) at L.2561 in the code fails to execute due to the condition self.is_local_process_zero() not being met on the slave nodes. This is intended to "Clean up the remaining staging checkpoint folders on other nodes," but it does not always work as expected.

File "XXX/transformers/src/transformers/trainer.py", line 2561, in _save_checkpoint
    shutil.rmtree(staging_output_dir)

File "XXX/lib/python3.11/shutil.py", line 681, in _rmtree_safe_fd
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)

FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)

FileNotFoundError: FileNotFoundError:     os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError:     os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: FileNotFoundError:     os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'

FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'[Errno 2] No such file or directory: 'rng_state_6.pth'

[89c6481]

        # Then go through the rewriting process, only renaming and rotating from main process(es)
        if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            if staging_output_dir != output_dir:
                if os.path.exists(staging_output_dir):
                    try:
                        os.rename(staging_output_dir, output_dir)
                    except Exception as e:
                        logger.error(
                            f"Error occurred when attempting to rename checkpoint folder: {e}\n"
                            "The checkpoint folder will not be renamed, but the training will proceed."
                        )

                    # Ensure rename completed in cases where os.rename is not atomic
                    # And can only happen on non-windows based systems
                    if os.name != "nt":
                        fd = os.open(output_dir, os.O_RDONLY)
                        os.fsync(fd)
                        os.close(fd)

            # Maybe delete some older checkpoints.
            if self.args.should_save:
                # Solely rely on numerical checkpoint id for rotation.
                # mtime is not reliable especially on some fuse fs in cloud environments.
                self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir) @L.2561
    
        self.args.distributed_state.wait_for_everyone()

Although os.path.exists(staging_output_dir) is used for verification, it seems that staging_output_dir does not exist when shutil.rmtree(staging_output_dir) is executed. It looks like a try-except block needs to be implemented here as well.

            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                try:
                    shutil.rmtree(staging_output_dir) @L.2561
                except Exception as e:
                     logger.error(
                            f"Error occurred when attempting to delete checkpoint folder: {e}\n"
                        )

                  if os.name != "nt":
                      fd = os.open(staging_output_dir, os.O_RDONLY)
                      os.fsync(fd)
                      os.close(fd)

amyeroberts · 2024-02-26T10:41:54Z

Hi @snowyday - could you open a new issue, including all these details and linking to this issue? This way we can better track what's been addressed and what's a new issue

chercheurkg · 2024-02-27T02:38:16Z

Hello @amyeroberts & @snowyday ,
I just wanted to share that I have encountered almost similar issue while using transformer 4.37.0 on Windows 10 (as admin) with single GPU. The error I got read as follows:

\lib\site-packages\transformers\trainer.py", line 2418, in _save_checkpoint fd = os.open(output_dir, os.O_RDONLY) PermissionError: [Errno 13] Permission denied: '.

amyeroberts · 2024-02-27T09:46:59Z

Hi @chercheurkg, have you tried on the latest release? There was a patch release for 4.37 which should have addressed this.

chercheurkg · 2024-02-27T19:56:16Z

@amyeroberts ,
Thanks for your reply! As per your suggestion, on the same machine, I used transformer version 4.37. However, it did not work for me. I got the same error.

amyeroberts · 2024-02-27T22:03:37Z

Ah, sorry, wasn't clear, I meant to use either 4.37.2 or 4.38.1

DreamInvoker · 2024-03-19T08:32:57Z

in my case, 4.38.2 also faces this issue. when upgraded to 4.37.2 on all nodes, it gets fixed.

amyeroberts · 2024-03-19T09:33:49Z

@DreamInvoker Could you try running on main? pip install git+https://github.com/huggingface/transformers

yuzhms · 2024-03-21T02:25:36Z

I also meet the same problem in 4,38.2. Using the 4.37,2 fix this issue.

tic-top · 2024-03-21T20:18:31Z

I have tried the latest version v3.40.0 with overwrite_output_dir=False . Everything works well.

I'm working on 4 nodes(32 GPU) sharing the same filesystem.
When using v4.39.0, No such file or directory: 'model/tmp-checkpoint-100' -> 'model/checkpoint-100' occurs.
After turning to v4.37.2 I encounter a new problem.

My first setting is shown below.

        do_train=True,
        do_eval=False,
        save_strategy="steps",
        save_steps=100
        save_total_limit=5
        overwrite_output_dir=True

The model stop saving the ckpt after 900 although my global step is 1300.

Then I train a new model with overwrite_output_dir=False,

zhenyuhe00 · 2024-03-23T14:38:23Z

same issue

FileNotFoundError: [Errno 2] No such file or directory:

ArthurZucker · 2024-03-25T09:31:14Z

Did you try with transformers==4.39.1?

ruian1 · 2024-07-29T00:26:24Z

I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.
My suggestion is to use something like this: if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:
Then self.args.distributed_state.wait_for_everyone() to synchronize everyone afterwards.
This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...
It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.

That's very nice of you to add "self.args.distributed_state.wait_for_everyone()" and I found that after saving the model checkpoint, it is sometimes probable to see: [Watchdog]() caught collective operation timeout: WorkNCCL(SeqNum=292968, OpType=_ALLGATHER_BASE, NumelIn=1882369, NumelOut=45176856.

Hi, were you able to get rid of this error? Thanks

azuryl · 2024-08-21T20:30:40Z

staging_output_dir = output_dir

staging_output_dir = output_dir

solanki-ravi · 2024-08-30T14:49:44Z

@ArthurZucker Can the AWS HuggingFace DL containers be updated as well? Currently Training Images are using Transformers 4.36.0 and impacted by this issue (i.e. All Training Jobs using Distributed Training with checkpoints are failing with this error, see log below).

Existing HuggingFace DL Container Images:
https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers

Transformer Version:
PyTorch 2.1.0 with HuggingFace transformers | training | GPU | 3.10 (py310) | 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04

HuggingFace Trainer on Sagemaker Logs:

ErrorMessage "FileNotFoundErrorFileNotFoundErrorFileNotFoundError: : : [Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'[Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'[Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'
 100%|██████████| 2903/2903 [2:35:27<00:00,  3.21s/it]
 [2024-08-30 09:19:24,433] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65 closing signal SIGTERM
 [2024-08-30 09:19:24,997] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 63) of binary: /opt/conda/bin/python
 Traceback (most recent call last)
 File "/opt/conda/bin/torchrun", line 33, in <module>
 sys.exit(load_entry_point('torch==2.1.0', 'console_scripts', 'torchrun')())
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
 return f(*args, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
 run(args)
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
 elastic_launch(
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
 return launch_agent(self._config, self._entrypoint, list(args))
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
 raise ChildFailedError(
 torch.distributed.elastic.multiprocessing.errors.ChildFailedError
 ============================================================
 train_fsdp.py FAILED
 ------------------------------------------------------------
 Failures
 [1]
 time      : 2024-08-30_09:19:24
 host      : algo-3
 rank      : 9 (local_rank: 1)
 exitcode  : 1 (pid: 64)
 error_file: <N/A>
 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
 [2]
 rank      : 11 (local_rank: 3)
 exitcode  : 1 (pid: 66)
 Root Cause (first observed failure)
 [0]
 rank      : 8 (local_rank: 0)
 exitcode  : 1 (pid: 63)"

ArthurZucker · 2024-09-06T09:14:29Z

cc @philschmid for the AWS container update!

solanki-ravi · 2024-09-06T16:31:50Z

FWIW, the following entry in my requirements.txt updates the transformer and accelerate seems to work well: -

transformers==4.44.2
accelerate==0.34.0

I am starting with this image:

Framework | Job Type | CPU/GPU | Python Version Options | Example URL
-- | -- | -- | -- | --
PyTorch 2.1.0 with HuggingFace transformers | training | GPU | 3.10 (py310) | 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04

Cospui changed the title ~~Save model checkpoint error when multiprocess training~~ Save model checkpoint error when multrigpu training Dec 9, 2023

Cospui changed the title ~~Save model checkpoint error when multrigpu training~~ Save model checkpoint error when multi-gpu training Dec 9, 2023

muellerzr self-assigned this Dec 9, 2023

thundergolfer mentioned this issue Dec 12, 2023

fix: handle multiprocess properly in trainer checkpointing #27929

Closed

5 tasks

Decem-Y mentioned this issue Dec 13, 2023

An error occurred when saving the model #27987

Closed

hiyouga mentioned this issue Dec 13, 2023

最新版本 transformers 多机训练保存 checkpoint 出错 hiyouga/LLaMA-Factory#1809

Closed

muellerzr mentioned this issue Dec 13, 2023

Fix bug with rotating checkpoints #28009

Merged

5 tasks

muellerzr closed this as completed in #28009 Dec 13, 2023

manishiitg mentioned this issue Dec 15, 2023

multi gpu training error Directory not empty axolotl-ai-cloud/axolotl#948

Closed

8 tasks

SlimeQ mentioned this issue Dec 15, 2023

saving checkpoints fails every time in docker on multi-gpu system axolotl-ai-cloud/axolotl#961

Closed

8 tasks

tblattner mentioned this issue Jan 5, 2024

Fix for checkpoint rename race condition #28364

Merged

muellerzr mentioned this issue Feb 29, 2024

🚨 Fully revert atomic checkpointing 🚨 #29370

Merged

5 tasks

qiyuangong mentioned this issue May 24, 2024

[script issue] - newly created checkpoint already contain a file. intel-analytics/ipex-llm#11099

Closed

ultrazhl98 mentioned this issue Jun 12, 2024

Save model checkpoint error when multi-gpu training modelscope/ms-swift#1127

Closed

azuryl mentioned this issue Aug 21, 2024

Save model checkpoint error when multi-gpu training NVlabs/DoRA#18

Open

Save model checkpoint error when multi-gpu training #27925

Save model checkpoint error when multi-gpu training #27925

Comments

Cospui commented Dec 9, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

jquesnelle commented Dec 12, 2023

staticpunch commented Dec 12, 2023

Andcircle commented Dec 12, 2023

peter-sk commented Dec 12, 2023

peter-sk commented Dec 12, 2023

muellerzr commented Dec 13, 2023 • edited Loading

hahmad2008 commented Dec 19, 2023 • edited Loading

muellerzr commented Dec 21, 2023

tblattner commented Dec 21, 2023 • edited Loading

snowyday commented Dec 26, 2023 • edited Loading

xk-huang commented Dec 27, 2023 • edited Loading

peter-sk commented Dec 29, 2023

lzy37ld commented Jan 1, 2024

roynirmal commented Jan 2, 2024

luvwinnie commented Jan 4, 2024

snowyday commented Jan 4, 2024

tblattner commented Jan 4, 2024

luvwinnie commented Jan 4, 2024

luvwinnie commented Jan 4, 2024

snowyday commented Jan 4, 2024 • edited Loading

snowyday commented Jan 4, 2024 • edited Loading

tblattner commented Jan 4, 2024

muellerzr commented Jan 5, 2024

mjbommar commented Jan 5, 2024

tblattner commented Jan 5, 2024

yuleiqin commented Jan 21, 2024

MaxGonzalezSaez-Diez commented Feb 21, 2024

ArthurZucker commented Feb 23, 2024

snowyday commented Feb 23, 2024 • edited Loading

amyeroberts commented Feb 26, 2024

chercheurkg commented Feb 27, 2024

amyeroberts commented Feb 27, 2024

chercheurkg commented Feb 27, 2024

amyeroberts commented Feb 27, 2024

DreamInvoker commented Mar 19, 2024

amyeroberts commented Mar 19, 2024

yuzhms commented Mar 21, 2024

tic-top commented Mar 21, 2024 • edited Loading

zhenyuhe00 commented Mar 23, 2024

ArthurZucker commented Mar 25, 2024

ruian1 commented Jul 29, 2024

azuryl commented Aug 21, 2024

solanki-ravi commented Aug 30, 2024 • edited Loading

ArthurZucker commented Sep 6, 2024

solanki-ravi commented Sep 6, 2024

Cospui commented Dec 9, 2023 •

edited

Loading

muellerzr commented Dec 13, 2023 •

edited

Loading

hahmad2008 commented Dec 19, 2023 •

edited

Loading

tblattner commented Dec 21, 2023 •

edited

Loading

snowyday commented Dec 26, 2023 •

edited

Loading

xk-huang commented Dec 27, 2023 •

edited

Loading

snowyday commented Jan 4, 2024 •

edited

Loading

snowyday commented Jan 4, 2024 •

edited

Loading

snowyday commented Feb 23, 2024 •

edited

Loading

tic-top commented Mar 21, 2024 •

edited

Loading

solanki-ravi commented Aug 30, 2024 •

edited

Loading