You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Hi,
I'm working on a setup where I have a frozen language model persisted in GPU memory and can fine-tune arbitrary many adapters during run time. So my "script" is a server that receives finetune requests through some REST api. Each time a finetune request is received, I'll create a new peft model for my base model, create a new train dataset, create a new HF trainer object, train and then save the adapter to disk.
This is basically working fine but as soon as I'm providing a deepspeed config, I'm getting the following exception on the second training run:
[rank0]: File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 431, in __init__
[rank0]: self.create_accelerator_and_postprocess()
[rank0]: File "/somewhere/venv/lib/python3.10/site-packages/transformers/trainer.py", line 4953, in create_accelerator_and_postprocess
[rank0]: self.accelerator = Accelerator(**args)
[rank0]: File "/somewhere/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 305, in __init__
[rank0]: raise NotImplementedError(
[rank0]: NotImplementedError: You cannot pass in a `deepspeed_plugin` when creating a second `Accelerator`. Please make sure the first `Accelerator` is initialized with all the plugins you want to use.
Obviously this is because the accelerate state is stored in some singletons. So I've tried to reset that state after each training run (in my own script) with these calls before creating the new trainer object:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.46.1- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1,2,3
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: True
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
@muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi,
I'm working on a setup where I have a frozen language model persisted in GPU memory and can fine-tune arbitrary many adapters during run time. So my "script" is a server that receives finetune requests through some REST api. Each time a finetune request is received, I'll create a new peft model for my base model, create a new train dataset, create a new HF trainer object, train and then save the adapter to disk.
This is basically working fine but as soon as I'm providing a deepspeed config, I'm getting the following exception on the second training run:
Obviously this is because the accelerate state is stored in some singletons. So I've tried to reset that state after each training run (in my own script) with these calls before creating the new trainer object:
This however will lead to the following exception during the second training run:
Expected behavior
See above
The text was updated successfully, but these errors were encountered: