Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi gpu training error Directory not empty #948

Closed
6 of 8 tasks
manishiitg opened this issue Dec 13, 2023 · 6 comments
Closed
6 of 8 tasks

multi gpu training error Directory not empty #948

manishiitg opened this issue Dec 13, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@manishiitg
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

should work

Current behaviour

(axolotl-hi-2-spot, pid=16959)   6%|▌         | 100/1758 [1:24:16<22:13:42, 48.26s/it]/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
(axolotl-hi-2-spot, pid=16959)   warnings.warn(
(axolotl-hi-2-spot, pid=16959) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
(axolotl-hi-2-spot, pid=16959)   warnings.warn(
(axolotl-hi-2-spot, pid=16959) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
(axolotl-hi-2-spot, pid=16959)   warnings.warn(
(axolotl-hi-2-spot, pid=16959) /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
(axolotl-hi-2-spot, pid=16959)   warnings.warn(
(axolotl-hi-2-spot, pid=16959) Traceback (most recent call last):
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(axolotl-hi-2-spot, pid=16959)     return _run_code(code, main_globals, None,
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
(axolotl-hi-2-spot, pid=16959)     exec(code, run_globals)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
(axolotl-hi-2-spot, pid=16959)     fire.Fire(do_cli)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
(axolotl-hi-2-spot, pid=16959)     component_trace = _Fire(component, args, parsed_flag_args, context, name)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
(axolotl-hi-2-spot, pid=16959)     component, remaining_args = _CallAndUpdateTrace(
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
(axolotl-hi-2-spot, pid=16959)     component = fn(*varargs, **kwargs)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
(axolotl-hi-2-spot, pid=16959)     train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/train.py", line 126, in train
(axolotl-hi-2-spot, pid=16959)     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
(axolotl-hi-2-spot, pid=16959)     return inner_training_loop(
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1917, in _inner_training_loop
(axolotl-hi-2-spot, pid=16959)     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2277, in _maybe_log_save_evaluate
(axolotl-hi-2-spot, pid=16959)     self._save_checkpoint(model, trial, metrics=metrics)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2386, in _save_checkpoint
(axolotl-hi-2-spot, pid=16959)     os.rename(staging_output_dir, output_dir)
(axolotl-hi-2-spot, pid=16959) OSError: [Errno 5] Input/output error: '/sky-notebook/manishiitg/aditi-gpt4-v2-hi-200K-dedupe/tmp-checkpoint-100' -> '/sky-notebook/manishiitg/aditi-gpt4-v2-hi-200K-dedupe/checkpoint-100'
(axolotl-hi-2-spot, pid=16959) Traceback (most recent call last):
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(axolotl-hi-2-spot, pid=16959)     return _run_code(code, main_globals, None,
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
(axolotl-hi-2-spot, pid=16959)     exec(code, run_globals)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
(axolotl-hi-2-spot, pid=16959)     fire.Fire(do_cli)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
(axolotl-hi-2-spot, pid=16959)     component_trace = _Fire(component, args, parsed_flag_args, context, name)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
(axolotl-hi-2-spot, pid=16959)     component, remaining_args = _CallAndUpdateTrace(
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
(axolotl-hi-2-spot, pid=16959)     component = fn(*varargs, **kwargs)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
(axolotl-hi-2-spot, pid=16959)     train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/train.py", line 126, in train
(axolotl-hi-2-spot, pid=16959)     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
(axolotl-hi-2-spot, pid=16959)     return inner_training_loop(
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1917, in _inner_training_loop
(axolotl-hi-2-spot, pid=16959)     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2277, in _maybe_log_save_evaluate
(axolotl-hi-2-spot, pid=16959)     self._save_checkpoint(model, trial, metrics=metrics)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2386, in _save_checkpoint
(axolotl-hi-2-spot, pid=16959)     os.rename(staging_output_dir, output_dir)
(axolotl-hi-2-spot, pid=16959) OSError: [Errno 39] Directory not empty: '/sky-notebook/manishiitg/aditi-gpt4-v2-hi-200K-dedupe/tmp-checkpoint-100' -> '/sky-notebook/manishiitg/aditi-gpt4-v2-hi-200K-dedupe/checkpoint-100'
(axolotl-hi-2-spot, pid=16959) Traceback (most recent call last):
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(axolotl-hi-2-spot, pid=16959)     return _run_code(code, main_globals, None,
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
(axolotl-hi-2-spot, pid=16959)     exec(code, run_globals)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
(axolotl-hi-2-spot, pid=16959)     fire.Fire(do_cli)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
(axolotl-hi-2-spot, pid=16959)     component_trace = _Fire(component, args, parsed_flag_args, context, name)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
(axolotl-hi-2-spot, pid=16959)     component, remaining_args = _CallAndUpdateTrace(
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
(axolotl-hi-2-spot, pid=16959)     component = fn(*varargs, **kwargs)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
(axolotl-hi-2-spot, pid=16959)     train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/train.py", line 126, in train
(axolotl-hi-2-spot, pid=16959)     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
(axolotl-hi-2-spot, pid=16959)     return inner_training_loop(
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1917, in _inner_training_loop
(axolotl-hi-2-spot, pid=16959)     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2277, in _maybe_log_save_evaluate
(axolotl-hi-2-spot, pid=16959)     self._save_checkpoint(model, trial, metrics=metrics)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2386, in _save_checkpoint
(axolotl-hi-2-spot, pid=16959)     os.rename(staging_output_dir, output_dir)
(axolotl-hi-2-spot, pid=16959) OSError: [Errno 39] Directory not empty: '/sky-notebook/manishiitg/aditi-gpt4-v2-hi-200K-dedupe/tmp-checkpoint-100' -> '/sky-notebook/manishiitg/aditi-gpt4-v2-hi-200K-dedupe/checkpoint-100'
(axolotl-hi-2-spot, pid=16959) Traceback (most recent call last):
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(axolotl-hi-2-spot, pid=16959)     return _run_code(code, main_globals, None,
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
(axolotl-hi-2-spot, pid=16959)     exec(code, run_globals)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
(axolotl-hi-2-spot, pid=16959)     fire.Fire(do_cli)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
(axolotl-hi-2-spot, pid=16959)     component_trace = _Fire(component, args, parsed_flag_args, context, name)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
(axolotl-hi-2-spot, pid=16959)     component, remaining_args = _CallAndUpdateTrace(
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
(axolotl-hi-2-spot, pid=16959)     component = fn(*varargs, **kwargs)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
(axolotl-hi-2-spot, pid=16959)     train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
(axolotl-hi-2-spot, pid=16959)   File "/workspace/axolotl/src/axolotl/train.py", line 126, in train
(axolotl-hi-2-spot, pid=16959)     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
(axolotl-hi-2-spot, pid=16959)     return inner_training_loop(
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1917, in _inner_training_loop
(axolotl-hi-2-spot, pid=16959)     self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2277, in _maybe_log_save_evaluate
(axolotl-hi-2-spot, pid=16959)     self._save_checkpoint(model, trial, metrics=metrics)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2386, in _save_checkpoint
(axolotl-hi-2-spot, pid=16959)     os.rename(staging_output_dir, output_dir)
(axolotl-hi-2-spot, pid=16959) OSError: [Errno 39] Directory not empty: '/sky-notebook/manishiitg/aditi-gpt4-v2-hi-200K-dedupe/tmp-checkpoint-100' -> '/sky-notebook/manishiitg/aditi-gpt4-v2-hi-200K-dedupe/checkpoint-100'
(axolotl-hi-2-spot, pid=16959) WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76 closing signal SIGTERM
(axolotl-hi-2-spot, pid=16959) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 77) of binary: /root/miniconda3/envs/py3.10/bin/python3
(axolotl-hi-2-spot, pid=16959) Traceback (most recent call last):
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/bin/accelerate", line 8, in <module>
(axolotl-hi-2-spot, pid=16959)     sys.exit(main())
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
(axolotl-hi-2-spot, pid=16959)     args.func(args)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
(axolotl-hi-2-spot, pid=16959)     multi_gpu_launcher(args)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
(axolotl-hi-2-spot, pid=16959)     distrib_run.run(args)
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
(axolotl-hi-2-spot, pid=16959)     elastic_launch(
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
(axolotl-hi-2-spot, pid=16959)     return launch_agent(self._config, self._entrypoint, list(args))
(axolotl-hi-2-spot, pid=16959)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
(axolotl-hi-2-spot, pid=16959)     raise ChildFailedError(
(axolotl-hi-2-spot, pid=16959) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
(axolotl-hi-2-spot, pid=16959) ============================================================
(axolotl-hi-2-spot, pid=16959) axolotl.cli.train FAILED
(axolotl-hi-2-spot, pid=16959) ------------------------------------------------------------
(axolotl-hi-2-spot, pid=16959) Failures:
(axolotl-hi-2-spot, pid=16959) [1]:
(axolotl-hi-2-spot, pid=16959)   time      : 2023-12-13_10:08:33
(axolotl-hi-2-spot, pid=16959)   host      : 412c60f30b4c
(axolotl-hi-2-spot, pid=16959)   rank      : 2 (local_rank: 2)
(axolotl-hi-2-spot, pid=16959)   exitcode  : 1 (pid: 78)
(axolotl-hi-2-spot, pid=16959)   error_file: <N/A>
(axolotl-hi-2-spot, pid=16959)   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(axolotl-hi-2-spot, pid=16959) [2]:
(axolotl-hi-2-spot, pid=16959)   time      : 2023-12-13_10:08:33
(axolotl-hi-2-spot, pid=16959)   host      : 412c60f30b4c
(axolotl-hi-2-spot, pid=16959)   rank      : 3 (local_rank: 3)
(axolotl-hi-2-spot, pid=16959)   exitcode  : 1 (pid: 79)
(axolotl-hi-2-spot, pid=16959)   error_file: <N/A>
(axolotl-hi-2-spot, pid=16959)   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(axolotl-hi-2-spot, pid=16959) ------------------------------------------------------------
(axolotl-hi-2-spot, pid=16959) Root Cause (first observed failure):
(axolotl-hi-2-spot, pid=16959) [0]:
(axolotl-hi-2-spot, pid=16959)   time      : 2023-12-13_10:08:33
(axolotl-hi-2-spot, pid=16959)   host      : 412c60f30b4c
(axolotl-hi-2-spot, pid=16959)   rank      : 1 (local_rank: 1)
(axolotl-hi-2-spot, pid=16959)   exitcode  : 1 (pid: 77)
(axolotl-hi-2-spot, pid=16959)   error_file: <N/A>

getting this errors

Steps to reproduce

training multi gpu mistral model

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@manishiitg manishiitg added the bug Something isn't working label Dec 13, 2023
@manishiitg manishiitg changed the title multi gpu training error multi gpu training error Directory not empty Dec 13, 2023
@manishiitg
Copy link
Author

should be fixed with this huggingface/transformers#27925

@hahmad2008
Copy link

@manishiitg I still get the same issue of saving checkpoint using the latest version of transformers 4.36 and even with ‘4.37.0.dev0’

I used three workers each one has two GPUs, I tried fine-tuning to be saved on a shared storage and a non-shared storage, and for both cases I still got the same error!

FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'

File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
  return inner_training_loop(
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
  self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _maybe_log_save_evaluate
  self._save_checkpoint(model, trial, metrics=metrics)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2395, in _save_checkpoint
  os.rename(staging_output_dir, output_dir)
FileNotFoundError: [Errno 2] No such file or directory: 'model/tmp-checkpoint-49' -> 'model/checkpoint-49'

although the model/checkpoint-49 is already created!

@manishiitg
Copy link
Author

it works fine for me..

should raise this issue in transfomers repo if it still exists

@hahmad2008
Copy link

@manishiitg did you use multi-node with multi gpus? or a single machine with multi gpus?

@manishiitg
Copy link
Author

Single machine with multiple gpus

@hahmad2008
Copy link

Thanks @manishiitg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants