-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save model checkpoint error when multi-gpu training #27925
Comments
I had this same issue, I temporarily fixed it by neutering the different staging directory: if os.path.exists(output_dir) and len(os.listdir(output_dir)) > 0:
logger.warning(
f"Checkpoint destination directory {output_dir} already exists and is non-empty."
"Saving will proceed but saved results may be invalid."
)
staging_output_dir = output_dir
else:
# staging_output_dir = os.path.join(run_dir, f"tmp-{checkpoint_folder}")
staging_output_dir = output_dir |
Where did you insert this? |
Facing same issue in multi-node training: |
This is a showstopper for training on multi-GPU nodes. The culprit seems to be the following merged PR #27820. |
There is an open PR #27929, which seems to fix the issue. |
Hi all, can you please do Thank you very much for your patience and flagging this! |
@muellerzr @thundergolfer I still get the same issue of saving checkpoint using the latest version of transformers I used three workers each one has two GPUs, I tried fine-tuning to be saved on a shared storage and a non-shared storage, and for both cases I still got the same error! FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'
although the |
@hahmad2008 can you try doing either |
I encountered this issue with the trainer with the following command-line. This was after recently updating transformers with pip install transformers --upgrade
transformers==4.36.2 Edit:
edit edit: Here is the full command, launched from a slurm sbatch job:
|
I encountered a similar error when using the trainer from DeepSpeed. I had no other choice, so I resorted to using a try block to get around it. if staging_output_dir != output_dir:
with self.args.main_process_first(
desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
):
if os.path.exists(staging_output_dir):
try:
os.rename(staging_output_dir, output_dir)
except Exception as e:
logger.info(f"Could not rename checkpoint directory from {staging_output_dir} to {output_dir}. Reason: {e}")
transformers-4.37.0.dev0 |
Hi, @snowyday , @tblattner , and @muellerzr . I think I run the trainer with 2 nodes X 8 V100 GPUs and deepspeed. When I turned on The log from process that waited:
|
I also encounter this with 4.36.2 and HEAD in a multi-node multi-GPU setup. Looks like an obvious race condition, as it happens indeterminately (sometimes 2nd save, sometimes 7th save etc). |
Hi Any update or final conclusion here? :> |
any solutions? facing the same issue on multinode training using deepspeed |
same here, any solutions? |
I've been using a try-except approach for bypassing the issue, and it's been working well for me. However, as xk-huang mentioned, it seems that the root cause is that self.args.main_process_first is not handling multi-node setups properly. |
Curious if there is any reason why we must do Haven't tested this code yet as my compute resources are currently filled and I have a long-running experiment set to finish in a couple days, but wanted to get some thoughts on this potential solution.
|
I'm using transformers's Trainer, is there any work around for this? |
For work around with Trainer, I just subclassed it and replace the _save_checkpoint method that added try exception.
|
I've checked the import logging
import deepspeed
import transformers
import torch
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()
if __name__ == "__main__":
deepspeed.init_distributed()
node_rank = torch.distributed.get_rank()
training_args = transformers.TrainingArguments(per_device_train_batch_size=8,
gradient_accumulation_steps=2,
num_train_epochs=3,
deepspeed="ds_config/ds_config_zero3.json",
output_dir="logs")
with training_args.main_process_first():
logger.info(f"Check `main_process_first`. Node rank {node_rank}")
The node rankings appear to be correctly allocated, with Node rank 0 going to node 1, Node rank 4 to node 2, and Node rank 8 to node 3; however, there are inaccuracies with the global rankings. In the context of a shared filesystem, if we proceed without waiting for the result from global rank 0, it could cause conflicts during the os.rename operation. if staging_output_dir != output_dir:
with self.args.main_process_first(
desc="Renaming model checkpoint folder to true location", local=self.args.save_on_each_node
):
if os.path.exists(staging_output_dir):
os.rename(staging_output_dir, output_dir)
|
In the case of processes sharing a filesystem, it seems prudent for only one process to wait for a rename operation to complete. However, why |
I'm not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here. My suggestion is to use something like this: Then This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I'm not sure of is if the renamed file is used later downstream, then that could introduce a race condition there... It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed. |
It is, so we could have a race condition. An |
FYI, we tested and also experienced this without shared FS (accelerate/ Also, if we rely on full |
I can get a start on a PR. Not sure what the best methodology for running fsync on a rename operation is, but I'll give it a shot. |
That's very nice of you to add "self.args.distributed_state.wait_for_everyone()" and I found that after saving the model checkpoint, it is sometimes probable to see: |
any updates? |
This was fixed by the PR I believe ! |
A similar error has now occurred at L.2561 89c6481 I am experiencing this issue in a distributed training environment that utilizes a shared file system across 16 nodes, with each node equipped with 4 GPUs. I'm deploying the training using DeepSpeed's OpenMPI launcher. In this setup, I have observed scenarios where the cleanup command shutil.rmtree(staging_output_dir) at L.2561 in the code fails to execute due to the condition self.is_local_process_zero() not being met on the slave nodes. This is intended to "Clean up the remaining staging checkpoint folders on other nodes," but it does not always work as expected. File "XXX/transformers/src/transformers/trainer.py", line 2561, in _save_checkpoint
shutil.rmtree(staging_output_dir)
File "XXX/lib/python3.11/shutil.py", line 681, in _rmtree_safe_fd
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth' os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth' os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: FileNotFoundError: os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: FileNotFoundError: os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth' os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
os.unlink(entry.name, dir_fd=topfd)
[Errno 2] No such file or directory: 'rng_state_6.pth'
FileNotFoundError: FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'FileNotFoundError: [Errno 2] No such file or directory: 'rng_state_6.pth'[Errno 2] No such file or directory: 'rng_state_6.pth' [89c6481] # Then go through the rewriting process, only renaming and rotating from main process(es)
if self.is_local_process_zero() if self.args.save_on_each_node else self.is_world_process_zero():
if staging_output_dir != output_dir:
if os.path.exists(staging_output_dir):
try:
os.rename(staging_output_dir, output_dir)
except Exception as e:
logger.error(
f"Error occurred when attempting to rename checkpoint folder: {e}\n"
"The checkpoint folder will not be renamed, but the training will proceed."
)
# Ensure rename completed in cases where os.rename is not atomic
# And can only happen on non-windows based systems
if os.name != "nt":
fd = os.open(output_dir, os.O_RDONLY)
os.fsync(fd)
os.close(fd)
# Maybe delete some older checkpoints.
if self.args.should_save:
# Solely rely on numerical checkpoint id for rotation.
# mtime is not reliable especially on some fuse fs in cloud environments.
self._rotate_checkpoints(use_mtime=False, output_dir=run_dir)
elif self.is_local_process_zero():
# Clean up the remaining staging checkpoint folders on other nodes
if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
shutil.rmtree(staging_output_dir) @L.2561
self.args.distributed_state.wait_for_everyone() Although if staging_output_dir != output_dir and os.path.exists(staging_output_dir):
try:
shutil.rmtree(staging_output_dir) @L.2561
except Exception as e:
logger.error(
f"Error occurred when attempting to delete checkpoint folder: {e}\n"
)
if os.name != "nt":
fd = os.open(staging_output_dir, os.O_RDONLY)
os.fsync(fd)
os.close(fd) |
Hi @snowyday - could you open a new issue, including all these details and linking to this issue? This way we can better track what's been addressed and what's a new issue |
Hello @amyeroberts & @snowyday ,
|
Hi @chercheurkg, have you tried on the latest release? There was a patch release for 4.37 which should have addressed this. |
@amyeroberts , |
Ah, sorry, wasn't clear, I meant to use either 4.37.2 or 4.38.1 |
in my case, 4.38.2 also faces this issue. when upgraded to 4.37.2 on all nodes, it gets fixed. |
@DreamInvoker Could you try running on |
I also meet the same problem in 4,38.2. Using the 4.37,2 fix this issue. |
same issue
|
Did you try with |
Hi, were you able to get rid of this error? Thanks |
staging_output_dir = output_dir |
@ArthurZucker Can the AWS HuggingFace DL containers be updated as well? Currently Training Images are using Transformers 4.36.0 and impacted by this issue (i.e. All Training Jobs using Distributed Training with checkpoints are failing with this error, see log below). Existing HuggingFace DL Container Images: Transformer Version: HuggingFace Trainer on Sagemaker Logs:
|
cc @philschmid for the AWS container update! |
FWIW, the following entry in my requirements.txt updates the transformer and accelerate seems to work well: - transformers==4.44.2 I am starting with this image:
|
System Info
transformers
version: 4.36.0.dev0Who can help?
@muellerzr and @pacman100 I found when launch the example trainer code with multi-nodes, the code will raise a FileNotFound error when saving the checkpoint, and after debug, I think the reason is in
trainer.py
L2382:When one process rename the folder, and other processes will encounter the FileNotFound error. Maybe one can modify the code like this to avoid the error:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run the MAE training code from the example folder.
Expected behavior
Solve the FileNotFound error.
The text was updated successfully, but these errors were encountered: