You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Checkpoints should save correctly and training should continue.
Current behaviour
10%|█████████▍ | 431/4310 [3:32:00<30:48:43, 28.60s/it]
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
fire.Fire(do_cli)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/workspace/axolotl/src/axolotl/train.py", line 129, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train
return inner_training_loop(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2277, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2379, in _save_checkpoint
self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
with open(json_path, "w", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-431/trainer_state.json'
the issue appears to be that transformers expects a staging directory (tmp-checkpoint-431) which doesn't exist since everything is being saved in checkpoint-431.
(FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-431/trainer_state.json')
Steps to reproduce
Boot up docker image on dual gpu system (4060ti's)
The issue is extremely simple, basically it just expects a staging dir named tmp-checkpoint-num and we're only making checkpoint-num. Possibly tmp-checkpoint is being deleted too early. I'm not really sure why.
I tried simply running pip install git+https://github.com/huggingface/transformers inside of docker as indicated there, but I'm still getting the issue. I'm somewhat unfamiliar with python environments so it's entirely possible i'm just doing it wrong.
Regardless though, this seems to be an outstanding issue with the docker container. I've also attempted to update the axolotl version inside of docker to main and the issue persists.
Which Operating Systems are you using?
Linux
macOS
Windows
Python Version
3.11
axolotl branch-commit
main
Acknowledgements
My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.
The text was updated successfully, but these errors were encountered:
updating transformers to main fixes the problem. the reason i was still having issues was because I restarted the docker container afterwards, not realizing that python packages aren't stored in the volume
Please check that this issue hasn't been reported before.
Expected Behavior
Checkpoints should save correctly and training should continue.
Current behaviour
the issue appears to be that transformers expects a staging directory (tmp-checkpoint-431) which doesn't exist since everything is being saved in checkpoint-431.
(FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-431/trainer_state.json')
Steps to reproduce
Config yaml
Possible solution
I did a fair amount of digging last night and determined this to be a bug in transformers.
huggingface/transformers#27925
The issue is extremely simple, basically it just expects a staging dir named tmp-checkpoint-num and we're only making checkpoint-num. Possibly tmp-checkpoint is being deleted too early. I'm not really sure why.
I tried simply running
pip install git+https://github.com/huggingface/transformers
inside of docker as indicated there, but I'm still getting the issue. I'm somewhat unfamiliar with python environments so it's entirely possible i'm just doing it wrong.Regardless though, this seems to be an outstanding issue with the docker container. I've also attempted to update the axolotl version inside of docker to main and the issue persists.
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: