Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple training runs not working with deepspeed #35073

Open
2 of 4 tasks
H-Simpson123 opened this issue Dec 4, 2024 · 4 comments
Open
2 of 4 tasks

Multiple training runs not working with deepspeed #35073

H-Simpson123 opened this issue Dec 4, 2024 · 4 comments
Labels

Comments

@H-Simpson123
Copy link

System Info

  • transformers version: 4.46.1
  • Platform: Linux-5.15.0-126-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.26.2
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: fp16
    - use_cpu: False
    - debug: False
    - num_processes: 4
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: 0,1,2,3
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: True
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA H100 80GB HBM3

Who can help?

@muellerzr

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hi,

I'm working on a setup where I have a frozen language model persisted in GPU memory and can fine-tune arbitrary many adapters during run time. So my "script" is a server that receives finetune requests through some REST api. Each time a finetune request is received, I'll create a new peft model for my base model, create a new train dataset, create a new HF trainer object, train and then save the adapter to disk.
This is basically working fine but as soon as I'm providing a deepspeed config, I'm getting the following exception on the second training run:

[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 431, in __init__
[rank0]:     self.create_accelerator_and_postprocess()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 4953, in create_accelerator_and_postprocess
[rank0]:     self.accelerator = Accelerator(**args)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 305, in __init__
[rank0]:     raise NotImplementedError(
[rank0]: NotImplementedError: You cannot pass in a `deepspeed_plugin` when creating a second `Accelerator`. Please make sure the first `Accelerator` is initialized with all the plugins you want to use.

Obviously this is because the accelerate state is stored in some singletons. So I've tried to reset that state after each training run (in my own script) with these calls before creating the new trainer object:

AcceleratorState._reset_state(True)
GradientState._reset_state()

This however will lead to the following exception during the second training run:

[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in training_step
[rank0]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank0]:   File "/somewhere/3rdparty-llama-factory/src/llamafactory/train/sft/trainer.py", line 88, in compute_loss
[rank0]:     loss = super().compute_loss(model, inputs, return_outputs, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3625, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/peft/peft_model.py", line 1577, in forward
[rank0]:     return self.base_model(
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 188, in forward
[rank0]:     return self.model.forward(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward
[rank0]:     outputs = self.model(
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 854, in forward
[rank0]:     inputs_embeds = self.embed_tokens(input_ids)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1779, in inner
[rank0]:     args_result = hook(self, args)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank0]:     self.pre_sub_module_forward_function(module)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank0]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 317, in fetch_sub_module
[rank0]:     assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
[rank0]: AssertionError: {'id': 0, 'status': 'INFLIGHT', 'numel': 136134656, 'ds_numel': 136134656, 'shape': (151936, 896), 'ds_shape': (151936, 896), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {1308}, 'ds_tensor.shape': torch.Size([34033664])}

Expected behavior

See above

@Rocketknight1
Copy link
Member

cc @muellerzr @SunMarc

Copy link

github-actions bot commented Jan 3, 2025

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@believewhat
Copy link

I also encounter this issue

@SunMarc
Copy link
Member

SunMarc commented Jan 22, 2025

Can you share a minimal reproducer @believewhat ?

@SunMarc SunMarc reopened this Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants