Multiple training runs not working with deepspeed #35073

H-Simpson123 · 2024-12-04T07:57:21Z

System Info

transformers version: 4.46.1
Platform: Linux-5.15.0-126-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1,2,3
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: True
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.5.1+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H100 80GB HBM3

Who can help?

@muellerzr

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Hi,

I'm working on a setup where I have a frozen language model persisted in GPU memory and can fine-tune arbitrary many adapters during run time. So my "script" is a server that receives finetune requests through some REST api. Each time a finetune request is received, I'll create a new peft model for my base model, create a new train dataset, create a new HF trainer object, train and then save the adapter to disk.
This is basically working fine but as soon as I'm providing a deepspeed config, I'm getting the following exception on the second training run:

[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 431, in __init__
[rank0]:     self.create_accelerator_and_postprocess()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 4953, in create_accelerator_and_postprocess
[rank0]:     self.accelerator = Accelerator(**args)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 305, in __init__
[rank0]:     raise NotImplementedError(
[rank0]: NotImplementedError: You cannot pass in a `deepspeed_plugin` when creating a second `Accelerator`. Please make sure the first `Accelerator` is initialized with all the plugins you want to use.

Obviously this is because the accelerate state is stored in some singletons. So I've tried to reset that state after each training run (in my own script) with these calls before creating the new trainer object:

AcceleratorState._reset_state(True)
GradientState._reset_state()

This however will lead to the following exception during the second training run:

[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2122, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3572, in training_step
[rank0]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank0]:   File "/somewhere/3rdparty-llama-factory/src/llamafactory/train/sft/trainer.py", line 88, in compute_loss
[rank0]:     loss = super().compute_loss(model, inputs, return_outputs, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 3625, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/peft/peft_model.py", line 1577, in forward
[rank0]:     return self.base_model(
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 188, in forward
[rank0]:     return self.model.forward(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward
[rank0]:     outputs = self.model(
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 854, in forward
[rank0]:     inputs_embeds = self.embed_tokens(input_ids)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1779, in inner
[rank0]:     args_result = hook(self, args)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 278, in _pre_forward_module_hook
[rank0]:     self.pre_sub_module_forward_function(module)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 452, in pre_sub_module_forward_function
[rank0]:     param_coordinator.fetch_sub_module(sub_module, forward=True)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/somewhere/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 317, in fetch_sub_module
[rank0]:     assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
[rank0]: AssertionError: {'id': 0, 'status': 'INFLIGHT', 'numel': 136134656, 'ds_numel': 136134656, 'shape': (151936, 896), 'ds_shape': (151936, 896), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': {1308}, 'ds_tensor.shape': torch.Size([34033664])}

Expected behavior

See above

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2024-12-04T14:20:35Z

cc @muellerzr @SunMarc

github-actions · 2025-01-03T08:03:09Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

believewhat · 2025-01-22T00:43:28Z

I also encounter this issue

SunMarc · 2025-01-22T14:51:35Z

Can you share a minimal reproducer @believewhat ?

H-Simpson123 added the bug label Dec 4, 2024

github-actions bot closed this as completed Jan 11, 2025

SunMarc reopened this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple training runs not working with deepspeed #35073

Multiple training runs not working with deepspeed #35073

H-Simpson123 commented Dec 4, 2024

Rocketknight1 commented Dec 4, 2024

github-actions bot commented Jan 3, 2025

believewhat commented Jan 22, 2025

SunMarc commented Jan 22, 2025

Multiple training runs not working with deepspeed #35073

Multiple training runs not working with deepspeed #35073

Comments

H-Simpson123 commented Dec 4, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented Dec 4, 2024

github-actions bot commented Jan 3, 2025

believewhat commented Jan 22, 2025

SunMarc commented Jan 22, 2025