Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save model checkpoint error when multi-gpu training #27925

Closed
1 of 4 tasks
Cospui opened this issue Dec 9, 2023 · 52 comments · Fixed by #28009 or #28364
Closed
1 of 4 tasks

Save model checkpoint error when multi-gpu training #27925

Cospui opened this issue Dec 9, 2023 · 52 comments · Fixed by #28009 or #28364
Assignees

Comments

@Cospui
Copy link

Cospui commented Dec 9, 2023

System Info

  • transformers version: 4.36.0.dev0
  • Platform: Linux-6.2.0-1017-azure-x86_64-with-glibc2.35
  • Python version: 3.10.13
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.4.0
  • Accelerate version: 0.24.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Who can help?

@muellerzr and @pacman100 I found when launch the example trainer code with multi-nodes, the code will raise a FileNotFound error when saving the checkpoint, and after debug, I think the reason is in trainer.py L2382:

        if staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

When one process rename the folder, and other processes will encounter the FileNotFound error. Maybe one can modify the code like this to avoid the error:

        if self.args.should_save and staging_output_dir != output_dir:
            os.rename(staging_output_dir, output_dir)

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run the MAE training code from the example folder.

Expected behavior

Solve the FileNotFound error.

@Cospui Cospui changed the title Save model checkpoint error when multiprocess training Save model checkpoint error when multrigpu training Dec 9, 2023
@Cospui Cospui changed the title Save model checkpoint error when multrigpu training Save model checkpoint error when multi-gpu training Dec 9, 2023
@muellerzr muellerzr self-assigned this Dec 9, 2023
@jquesnelle
Copy link

I had this same issue, I temporarily fixed it by neutering the different staging directory:

if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
    logger.warning(
        f"Checkpoint destination directory {output_dir} already exists and is non-empty."
        "Saving will proceed but saved results may be invalid."
    )
    staging_output_dir = output_dir
else:
    # staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
    staging_output_dir = output_dir

@staticpunch
Copy link

I had this same issue, I temporarily fixed it by neutering the different staging directory:

if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
    logger.warning(
        f"Checkpoint destination directory {output_dir} already exists and is non-empty."
        "Saving will proceed but saved results may be invalid."
    )
    staging_output_dir = output_dir
else:
    # staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
    staging_output_dir = output_dir

Where did you insert this?

@Andcircle
Copy link

Facing same issue in multi-node training:
File "/home/user/.local/lib/python3.8/site-packages/transformers/trainer.py", line 2353, in _save_checkpoint self.save_model(staging_output_dir, _internal_call=True) RuntimeError: Parent directory tmp-checkpoint-200 does not exist.
It added annoying tmp- in front of the checkpoint

@peter-sk
Copy link
Contributor

This is a showstopper for training on multi-GPU nodes. The culprit seems to be the following merged PR #27820.

@peter-sk
Copy link
Contributor

There is an open PR #27929, which seems to fix the issue.
@ArthurZucker @sgugger @younesbelkada

@muellerzr
Copy link
Contributor

muellerzr commented Dec 13, 2023

Hi all, can you please do pip install git+https://github.com/huggingface/transformers and rerun your code? This should fix your issue now.

Thank you very much for your patience and flagging this!

@hahmad2008
Copy link

hahmad2008 commented Dec 19, 2023

@muellerzr @thundergolfer I still get the same issue of saving checkpoint using the latest version of transformers 4.36 and even with ‘4.37.0.dev0’

I used three workers each one has two GPUs, I tried fine-tuning to be saved on a shared storage and a non-shared storage, and for both cases I still got the same error!

FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'

File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
  return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
  self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _maybe_log_save_evaluate
  self._save_checkpoint(model, trial, metrics=metrics)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2395, in _save_checkpoint
  os.rename(staging_output_dir, output_dir)
FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'

although the model/checkpoint-49 is already created!

@muellerzr
Copy link
Contributor

@hahmad2008 can you try doing either pip install transformers -U or reinstall from git? From the line numbers it's not adding up that you're using a version that includes the fix

@tblattner
Copy link
Contributor

tblattner commented Dec 21, 2023

I encountered this issue with the trainer with the following command-line. This was after recently updating transformers with pip install transformers --upgrade

--save_strategy epoch --save_total_limit 1

transformers==4.36.2

Edit:
One thing to note this was with 2 nodes with 8x A100s per node.
Looking at the code around the error, I have a feeling this was because I may have used local=True when using with main_process_first. Going to try disabling save_on_each_node.

        if staging_output_dir != output_dir:
            with self.args.main_process_first(
                desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
            ):
                if os.path.exists(staging_output_dir):
                    os.rename(staging_output_dir, output_dir)

edit edit:
Looks like its still not working even when specifying save_on_each_node to false.

Here is the full command, launched from a slurm sbatch job:

srun --kill-on-bad-exit=1 --jobid $SLURM_JOB_ID bash -c "accelerate launch --use_deepspeed --zero_stage 1 --deepspeed_hostfile hostfile --deepspeed_multinode_launcher openmpi --gradient_accumulation_steps 1 --num_processes $(( $NUM_GPUS * $COUNT_NODE )) --num_machines $COUNT_NODE --num_cpu_threads_per_process $CPU_COUNT --mixed_precision bf16 --machine_rank \$SLURM_PROCID --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT main.py --source_datasets_filepath source_data/clm --output_dir testing_output_cluster --model_number 2 --overwrite_output_dir --dataloader_num_workers 10 --bf16 --data_fraction 0.1 --save_strategy steps --save_total_limit 1 --save_on_each_node false --dataloader_num_workers 2 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --max_token_length 1024 --num_train_epochs 1"

@snowyday
Copy link

snowyday commented Dec 26, 2023

I encountered a similar error when using the trainer from DeepSpeed.
The error occurs at the exact moment after if os.path.exists(staging_output_dir): is evaluated and another process finishes renaming.

I had no other choice, so I resorted to using a try block to get around it.

if staging_output_dir != output_dir:
    with self.args.main_process_first(
        desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
    ):
        if os.path.exists(staging_output_dir):
            try:
                os.rename(staging_output_dir, output_dir)
            except Exception as e:
                logger.info(f"Could not rename checkpoint directory from {staging_output_dir} to {output_dir}. Reason: {e}")
    

transformers-4.37.0.dev0

@xk-huang
Copy link
Contributor

xk-huang commented Dec 27, 2023

Hi, @snowyday , @tblattner , and @muellerzr . I think main_process_first may be broken.

I run the trainer with 2 nodes X 8 V100 GPUs and deepspeed. When I turned on log_level=debug, I found that only one process entered the waiting mode, while all other processes tried to save the checkpoint.

The log from process that waited:

[DEBUG|training_args.py:2119] 2023-12-27 15:11:30,917 >> 4: waiting for the main process to perform Renaming model checkpoint folder to true location

@peter-sk
Copy link
Contributor

I also encounter this with 4.36.2 and HEAD in a multi-node multi-GPU setup. Looks like an obvious race condition, as it happens indeterminately (sometimes 2nd save, sometimes 7th save etc).

@lzy37ld
Copy link

lzy37ld commented Jan 1, 2024

Hi Any update or final conclusion here? :>

@roynirmal
Copy link

any solutions? facing the same issue on multinode training using deepspeed

@luvwinnie
Copy link

same here, any solutions?

@snowyday
Copy link

snowyday commented Jan 4, 2024

I've been using a try-except approach for bypassing the issue, and it's been working well for me. However, as xk-huang mentioned, it seems that the root cause is that self.args.main_process_first is not handling multi-node setups properly.

@tblattner
Copy link
Contributor

Curious if there is any reason why we must do os.path.exists and os.rename for each process, why not just the main process(es)?

Haven't tested this code yet as my compute resources are currently filled and I have a long-running experiment set to finish in a couple days, but wanted to get some thoughts on this potential solution.

        # Only rename from main process to avoid race condition from other processes especially for distributed filesystems
        if staging_output_dir != output_dir:
            if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:
                if os.path.exists(staging_output_dir):
                    os.rename(staging_output_dir, output_dir)

            self.args.distributed_state.wait_for_everyone()

@luvwinnie
Copy link

I'm using transformers's Trainer, is there any work around for this?

@luvwinnie
Copy link

For work around with Trainer, I just subclassed it and replace the _save_checkpoint method that added try exception.

class CustomTrainer(Trainer):
    def _save_checkpoint(self, model, trial, metrics=None):
        # In all cases, including ddp/dp/deepspeed, self.model is always a reference to the model we
        # want to save except FullyShardedDDP.
        # assert unwrap_model(model) is self.model, "internal model should be a reference to self.model"

        # Save model checkpoint
        checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"

        if self.hp_search_backend is None and trial is None:
            self.store_flos()

        run_dir = self._get_output_dir(trial=trial)
        output_dir = os.path.join(run_dir, checkpoint_folder)
        if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
            logger.warning(
                f"Checkpoint destination directory {output_dir} already exists and is non-empty."
                "Saving will proceed but saved results may be invalid."
            )
            staging_output_dir = output_dir
        else:
            staging_output_dir = os.path.join(
                run_dir, f"tmp-{checkpoint_folder}")
        self.save_model(staging_output_dir, _internal_call=True)

        if not self.args.save_only_model:
            # Save optimizer and scheduler
            self._save_optimizer_and_scheduler(staging_output_dir)
            # Save RNG state
            self._save_rng_state(staging_output_dir)

        # Determine the new best metric / best model checkpoint
        if metrics is not None and self.args.metric_for_best_model is not None:
            metric_to_check = self.args.metric_for_best_model
            if not metric_to_check.startswith("eval_"):
                metric_to_check = f"eval_{metric_to_check}"
            metric_value = metrics[metric_to_check]

            operator = np.greater if self.args.greater_is_better else np.less
            if (
                self.state.best_metric is None
                or self.state.best_model_checkpoint is None
                or operator(metric_value, self.state.best_metric)
            ):
                self.state.best_metric = metric_value
                self.state.best_model_checkpoint = output_dir

        # Save the Trainer state
        if self.args.should_save:
            self.state.save_to_json(os.path.join(
                staging_output_dir, TRAINER_STATE_NAME))

        if self.args.push_to_hub:
            self._push_from_checkpoint(staging_output_dir)

        # Place checkpoint in final location after all saving is finished.
        # First wait for everyone to finish writing
        self.args.distributed_state.wait_for_everyone()
        # Then go through the rewriting process starting on process 0
        try:
            if staging_output_dir != output_dir:
                with self.args.main_process_first(
                    desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
                ):
                    if os.path.exists(staging_output_dir):
                        os.rename(staging_output_dir, output_dir)

            # Maybe delete some older checkpoints.
            if self.args.should_save:
                self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
        except Exception:
            print("Error rotating checkpoints skipping")
            pass

@snowyday
Copy link

snowyday commented Jan 4, 2024

I've checked the main_process_first using the code snippet below:
Number of nodes: 3
Processes per node (GPUs): 4
Total: 12 processes

import logging

import deepspeed
import transformers
import torch


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

if __name__ == "__main__":
    deepspeed.init_distributed()
    node_rank = torch.distributed.get_rank()   
    training_args = transformers.TrainingArguments(per_device_train_batch_size=8,
                                                   gradient_accumulation_steps=2,
                                                   num_train_epochs=3,
                                                   deepspeed="ds_config/ds_config_zero3.json",
                                                   output_dir="logs")

    with training_args.main_process_first():
        logger.info(f"Check `main_process_first`. Node rank {node_rank}")
Address family not supported by protocol).
[INFO:root:Check `main_process_first`. Node rank 8
INFO:root:Check `main_process_first`. Node rank 0
INFO:root:Check `main_process_first`. Node rank 4
INFO:root:Check `main_process_first`. Node rank 6
INFO:root:Check `main_process_first`. Node rank 10
INFO:root:Check `main_process_first`. Node rank 5
INFO:root:Check `main_process_first`. Node rank 9
INFO:root:Check `main_process_first`. Node rank 1
INFO:root:Check `main_process_first`. Node rank 2
INFO:root:Check `main_process_first`. Node rank 3
INFO:root:Check `main_process_first`. Node rank 7
INFO:root:Check `main_process_first`. Node rank 11

The node rankings appear to be correctly allocated, with Node rank 0 going to node 1, Node rank 4 to node 2, and Node rank 8 to node 3; however, there are inaccuracies with the global rankings. In the context of a shared filesystem, if we proceed without waiting for the result from global rank 0, it could cause conflicts during the os.rename operation.

if staging_output_dir != output_dir:
    with self.args.main_process_first(
        desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
    ):
        if os.path.exists(staging_output_dir):
            os.rename(staging_output_dir, output_dir)
        

@snowyday
Copy link

snowyday commented Jan 4, 2024

In the case of processes sharing a filesystem, it seems prudent for only one process to wait for a rename operation to complete. However, why main_process_first is being used? On a shared filesystem, if the rename() fails, options are limited. Is this why multiple processes are making repeated attempts?

@tblattner
Copy link
Contributor

I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.

My suggestion is to use something like this:
if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:

Then self.args.distributed_state.wait_for_everyone() to synchronize everyone afterwards.

This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...

It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.

@muellerzr
Copy link
Contributor

It is, so we could have a race condition. An fsync could be done certainly and your logic makes sense. @tblattner would you like to open a PR on this by chance?

@mjbommar
Copy link

mjbommar commented Jan 5, 2024

FYI, we tested and also experienced this without shared FS (accelerate/pdsh, simple two-node setup).

Also, if we rely on full fsync implementation in checkpoint folder, it might be good to explicitly call that out in docs as not all filesystems/mount options will fail hard on "fake" fsync calls.

@tblattner
Copy link
Contributor

It is, so we could have a race condition. An fsync could be done certainly and your logic makes sense. @tblattner would you like to open a PR on this by chance?

I can get a start on a PR. Not sure what the best methodology for running fsync on a rename operation is, but I'll give it a shot.

@yuleiqin
Copy link

I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.

My suggestion is to use something like this: if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:

Then self.args.distributed_state.wait_for_everyone() to synchronize everyone afterwards.

This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...

It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.

That's very nice of you to add "self.args.distributed_state.wait_for_everyone()" and I found that after saving the model checkpoint, it is sometimes probable to see:
[Watchdog]() caught collective operation timeout: WorkNCCL(SeqNum=292968, OpType=_ALLGATHER_BASE, NumelIn=1882369, NumelOut=45176856.

@MaxGonzalezSaez-Diez
Copy link

any updates?

@ArthurZucker
Copy link
Collaborator

This was fixed by the PR I believe !

@snowyday
Copy link

snowyday commented Feb 23, 2024

A similar error has now occurred at L.2561 89c6481

I am experiencing this issue in a distributed training environment that utilizes a shared file system across 16 nodes, with each node equipped with 4 GPUs. I'm deploying the training using DeepSpeed's OpenMPI launcher.

In this setup, I have observed scenarios where the cleanup command shutil.rmtree(staging_output_dir) at L.2561 in the code fails to execute due to the condition self.is_local_process_zero() not being met on the slave nodes. This is intended to "Clean up the remaining staging checkpoint folders on other nodes," but it does not always work as expected.

File "XXX/transformers/src/transformers/trainer.py", line 2561, in _save_checkpoint
    shutil.rmtree(staging_output_dir)

File "XXX/lib/python3.11/shutil.py", line 681, in _rmtree_safe_fd
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)

FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)

FileNotFoundError: FileNotFoundError:     os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError:     os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: FileNotFoundError:     os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
    os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'

FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'[Errno 2] No such file or directory: 'rng_state_6.pth'

[89c6481]

        # Then go through the rewriting process, only renaming and rotating from main process(es)
        if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
            if staging_output_dir != output_dir:
                if os.path.exists(staging_output_dir):
                    try:
                        os.rename(staging_output_dir, output_dir)
                    except Exception as e:
                        logger.error(
                            f"Error occurred when attempting to rename checkpoint folder: {e}\n"
                            "The checkpoint folder will not be renamed, but the training will proceed."
                        )

                    # Ensure rename completed in cases where os.rename is not atomic
                    # And can only happen on non-windows based systems
                    if os.name != "nt":
                        fd = os.open(output_dir, os.O_RDONLY)
                        os.fsync(fd)
                        os.close(fd)

            # Maybe delete some older checkpoints.
            if self.args.should_save:
                # Solely rely on numerical checkpoint id for rotation.
                # mtime is not reliable especially on some fuse fs in cloud environments.
                self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
        elif self.is_local_process_zero():
            # Clean up the remaining staging checkpoint folders on other nodes
            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                shutil.rmtree(staging_output_dir) @L.2561
    
        self.args.distributed_state.wait_for_everyone()

Although os.path.exists(staging_output_dir) is used for verification, it seems that staging_output_dir does not exist when shutil.rmtree(staging_output_dir) is executed. It looks like a try-except block needs to be implemented here as well.

            if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
                try:
                    shutil.rmtree(staging_output_dir) @L.2561
                except Exception as e:
                     logger.error(
                            f"Error occurred when attempting to delete checkpoint folder: {e}\n"
                        )

                  if os.name != "nt":
                      fd = os.open(staging_output_dir, os.O_RDONLY)
                      os.fsync(fd)
                      os.close(fd)

@amyeroberts
Copy link
Collaborator

Hi @snowyday - could you open a new issue, including all these details and linking to this issue? This way we can better track what's been addressed and what's a new issue

@chercheurkg
Copy link

Hello @amyeroberts & @snowyday ,
I just wanted to share that I have encountered almost similar issue while using transformer 4.37.0 on Windows 10 (as admin) with single GPU. The error I got read as follows:

\lib\site-packages\transformers\trainer.py", line 2418, in _save_checkpoint fd = os.open(output_dir, os.O_RDONLY) PermissionError: [Errno 13] Permission denied: '.

@amyeroberts
Copy link
Collaborator

Hi @chercheurkg, have you tried on the latest release? There was a patch release for 4.37 which should have addressed this.

@chercheurkg
Copy link

@amyeroberts ,
Thanks for your reply! As per your suggestion, on the same machine, I used transformer version 4.37. However, it did not work for me. I got the same error.

@amyeroberts
Copy link
Collaborator

Ah, sorry, wasn't clear, I meant to use either 4.37.2 or 4.38.1

@DreamInvoker
Copy link

in my case, 4.38.2 also faces this issue. when upgraded to 4.37.2 on all nodes, it gets fixed.

@amyeroberts
Copy link
Collaborator

@DreamInvoker Could you try running on main? pip install git+https://github.com/huggingface/transformers

@yuzhms
Copy link

yuzhms commented Mar 21, 2024

I also meet the same problem in 4,38.2. Using the 4.37,2 fix this issue.

@tic-top
Copy link

tic-top commented Mar 21, 2024

I have tried the latest version v3.40.0 with overwrite_output_dir=False . Everything works well.

I'm working on 4 nodes(32 GPU) sharing the same filesystem.
When using v4.39.0, No such file or directory: 'model/tmp-checkpoint-100' -> 'model/checkpoint-100' occurs.
After turning to v4.37.2 I encounter a new problem.

My first setting is shown below.

        do_train=True,
        do_eval=False,
        save_strategy="steps",
        save_steps=100
        save_total_limit=5
        overwrite_output_dir=True

The model stop saving the ckpt after 900 although my global step is 1300.
image

Then I train a new model with overwrite_output_dir=False,
image

@zhenyuhe00
Copy link

same issue

FileNotFoundError: [Errno 2] No such file or directory:

@ArthurZucker
Copy link
Collaborator

Did you try with transformers==4.39.1?

@ruian1
Copy link

ruian1 commented Jul 29, 2024

I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.
My suggestion is to use something like this: if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:
Then self.args.distributed_state.wait_for_everyone() to synchronize everyone afterwards.
This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there...
It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.

That's very nice of you to add "self.args.distributed_state.wait_for_everyone()" and I found that after saving the model checkpoint, it is sometimes probable to see: [Watchdog]() caught collective operation timeout: WorkNCCL(SeqNum=292968, OpType=_ALLGATHER_BASE, NumelIn=1882369, NumelOut=45176856.

Hi, were you able to get rid of this error? Thanks

@azuryl
Copy link

azuryl commented Aug 21, 2024

staging_output_dir = output_dir

staging_output_dir = output_dir

@solanki-ravi
Copy link

solanki-ravi commented Aug 30, 2024

@ArthurZucker Can the AWS HuggingFace DL containers be updated as well? Currently Training Images are using Transformers 4.36.0 and impacted by this issue (i.e. All Training Jobs using Distributed Training with checkpoints are failing with this error, see log below).

Existing HuggingFace DL Container Images:
https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers

Transformer Version:
PyTorch 2.1.0 with HuggingFace transformers | training | GPU | 3.10 (py310) | 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04

HuggingFace Trainer on Sagemaker Logs:

ErrorMessage "FileNotFoundErrorFileNotFoundErrorFileNotFoundError: : : [Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'[Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'[Errno 2] No such file or directory: '/opt/ml/model/tmp-checkpoint-2903' -> '/opt/ml/model/checkpoint-2903'
 100%|██████████| 2903/2903 [2:35:27<00:00,  3.21s/it]
 [2024-08-30 09:19:24,433] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65 closing signal SIGTERM
 [2024-08-30 09:19:24,997] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 63) of binary: /opt/conda/bin/python
 Traceback (most recent call last)
 File "/opt/conda/bin/torchrun", line 33, in <module>
 sys.exit(load_entry_point('torch==2.1.0', 'console_scripts', 'torchrun')())
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
 return f(*args, **kwargs)
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
 run(args)
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
 elastic_launch(
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
 return launch_agent(self._config, self._entrypoint, list(args))
 File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
 raise ChildFailedError(
 torch.distributed.elastic.multiprocessing.errors.ChildFailedError
 ============================================================
 train_fsdp.py FAILED
 ------------------------------------------------------------
 Failures
 [1]
 time      : 2024-08-30_09:19:24
 host      : algo-3
 rank      : 9 (local_rank: 1)
 exitcode  : 1 (pid: 64)
 error_file: <N/A>
 traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
 [2]
 rank      : 11 (local_rank: 3)
 exitcode  : 1 (pid: 66)
 Root Cause (first observed failure)
 [0]
 rank      : 8 (local_rank: 0)
 exitcode  : 1 (pid: 63)"

@ArthurZucker
Copy link
Collaborator

cc @philschmid for the AWS container update!

@solanki-ravi
Copy link

FWIW, the following entry in my requirements.txt updates the transformer and accelerate seems to work well: -

transformers==4.44.2
accelerate==0.34.0

I am starting with this image:

Framework | Job Type | CPU/GPU | Python Version Options | Example URL
-- | -- | -- | -- | --
PyTorch 2.1.0 with HuggingFace transformers | training | GPU | 3.10 (py310) | 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet