saving checkpoints fails every time in docker on multi-gpu system #961

SlimeQ · 2023-12-15T20:06:04Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Checkpoints should save correctly and training should continue.

Current behaviour

 10%|█████████▍                                                                                    | 431/4310 [3:32:00<30:48:43, 28.60s/it]
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/workspace/axolotl/src/axolotl/train.py", line 129, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1540, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2277, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2379, in _save_checkpoint
    self.state.save_to_json(os.path.join(staging_output_dir, TRAINER_STATE_NAME))
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer_callback.py", line 114, in save_to_json
    with open(json_path, "w", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-431/trainer_state.json'

the issue appears to be that transformers expects a staging directory (tmp-checkpoint-431) which doesn't exist since everything is being saved in checkpoint-431.
(FileNotFoundError: [Errno 2] No such file or directory: './qlora-out/tmp-checkpoint-431/trainer_state.json')

Steps to reproduce

Boot up docker image on dual gpu system (4060ti's)
adjust mistral qlora.yml example
run training via accelerate

Config yaml

base_model: teknium/OpenHermes-2.5-Mistral-7B
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: <REDACTED>
    type: completion
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./qlora-out

adapter: qlora
lora_model_dir:

sequence_len: 4096
eval_sample_packing: false
sample_packing: false
pad_to_sequence_len: true

lora_r: 64
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 10
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_steps: 10
eval_steps: 0.05
eval_table_size:
eval_table_max_new_tokens: 128
save_steps:
debug:
deepspeed: ../deepspeed/zero1.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

I did a fair amount of digging last night and determined this to be a bug in transformers.

huggingface/transformers#27925

The issue is extremely simple, basically it just expects a staging dir named tmp-checkpoint-num and we're only making checkpoint-num. Possibly tmp-checkpoint is being deleted too early. I'm not really sure why.

I tried simply running pip install git+https://github.com/huggingface/transformers inside of docker as indicated there, but I'm still getting the issue. I'm somewhat unfamiliar with python environments so it's entirely possible i'm just doing it wrong.

Regardless though, this seems to be an outstanding issue with the docker container. I've also attempted to update the axolotl version inside of docker to main and the issue persists.

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

dumpmemory · 2023-12-16T01:17:43Z

update to transformers main branch

SlimeQ · 2023-12-22T19:28:03Z

updating transformers to main fixes the problem. the reason i was still having issues was because I restarted the docker container afterwards, not realizing that python packages aren't stored in the volume

NanoCode012 · 2023-12-29T09:45:33Z

Yep! I think this was a big issue previously due to 4.36 patch (I think). We also updated the pinned transformers to fix this.

SlimeQ added the bug Something isn't working label Dec 15, 2023

NanoCode012 closed this as completed Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

saving checkpoints fails every time in docker on multi-gpu system #961

saving checkpoints fails every time in docker on multi-gpu system #961

SlimeQ commented Dec 15, 2023 •

edited

Loading

dumpmemory commented Dec 16, 2023

SlimeQ commented Dec 22, 2023

NanoCode012 commented Dec 29, 2023

saving checkpoints fails every time in docker on multi-gpu system #961

saving checkpoints fails every time in docker on multi-gpu system #961

Comments

SlimeQ commented Dec 15, 2023 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

dumpmemory commented Dec 16, 2023

SlimeQ commented Dec 22, 2023

NanoCode012 commented Dec 29, 2023

SlimeQ commented Dec 15, 2023 •

edited

Loading