Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

saving checkpoints fails every time in docker on multi-gpu system #961

Closed
6 of 8 tasks
SlimeQ opened this issue Dec 15, 2023 · 3 comments
Closed
6 of 8 tasks

saving checkpoints fails every time in docker on multi-gpu system #961

SlimeQ opened this issue Dec 15, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@SlimeQ
Copy link

SlimeQ commented Dec 15, 2023

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Checkpoints should save correctly and training should continue.

Current behaviour

 10%|█████████▍                                                                                    | 431/4310 [3:32:00<30:48:43, 28.60s/it]
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/workspace/axolotl/src/axolotl/train.py", line 129, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2277, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2379, in _save_checkpoint
    self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
    with open(json_path, "w", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-431/trainer_state.json'

the issue appears to be that transformers expects a staging directory (tmp-checkpoint-431) which doesn't exist since everything is being saved in checkpoint-431.
(FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-431/trainer_state.json')

Steps to reproduce

  1. Boot up docker image on dual gpu system (4060ti's)
  2. adjust mistral qlora.yml example
  3. run training via accelerate

Config yaml

base_model: teknium/OpenHermes-2.5-Mistral-7B
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: <REDACTED>
    type: completion
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./qlora-out

adapter: qlora
lora_model_dir:

sequence_len: 4096
eval_sample_packing: false
sample_packing: false
pad_to_sequence_len: true

lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 10
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_steps: 10
eval_steps: 0.05
eval_table_size:
eval_table_max_new_tokens: 128
save_steps:
debug:
deepspeed: ../deepspeed/zero1.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

I did a fair amount of digging last night and determined this to be a bug in transformers.

huggingface/transformers#27925

The issue is extremely simple, basically it just expects a staging dir named tmp-checkpoint-num and we're only making checkpoint-num. Possibly tmp-checkpoint is being deleted too early. I'm not really sure why.

I tried simply running pip install git+https://github.com/huggingface/transformers inside of docker as indicated there, but I'm still getting the issue. I'm somewhat unfamiliar with python environments so it's entirely possible i'm just doing it wrong.

Regardless though, this seems to be an outstanding issue with the docker container. I've also attempted to update the axolotl version inside of docker to main and the issue persists.

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@SlimeQ SlimeQ added the bug Something isn't working label Dec 15, 2023
@dumpmemory
Copy link
Contributor

update to transformers main branch

@SlimeQ
Copy link
Author

SlimeQ commented Dec 22, 2023

updating transformers to main fixes the problem. the reason i was still having issues was because I restarted the docker container afterwards, not realizing that python packages aren't stored in the volume

@NanoCode012
Copy link
Collaborator

Yep! I think this was a big issue previously due to 4.36 patch (I think). We also updated the pinned transformers to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants