Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

You can't train a model that has been loaded in 8-bit precision on multiple devices #414

Closed
mn9891 opened this issue May 8, 2023 · 5 comments

Comments

@mn9891
Copy link

mn9891 commented May 8, 2023

Hi,

I was trying to tune Whisper using the script shared in the examples (this one here)
while using multiple GPUs (4), however that throws an error saying:
You can't train a model that has been loaded in 8-bit precision on multiple devices

Is it supported to train a 8bit whisper model on multiple GPUs?
I note that the using 1GPU works fine using the same script. Am I missing something?

The full error message:

Detected 8-bit loading: activating 8-bit loading for this model
All model checkpoint weights were used when initializing WhisperForConditionalGeneration.

All the weights of WhisperForConditionalGeneration were initialized from the model checkpoint at openai/whisper-large-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use WhisperForConditionalGeneration for predictions without further training.
loading configuration file generation_config.json from cache at /home/user17/.cache/huggingface/hub/models--openai--whisper-large-v2/snapshots/1f66457e6e36eeb6d89078882a39003e55c330b8/generation_config.json
Generation config file not found, using a generation config created from the model config.
trainable params: 21633024 || all params: 1564938496 || trainable%: 1.3823561791913386
Traceback (most recent call last):
File "peft_adalora_whisper_large_training.py", line 777, in
main()
File "peft_adalora_whisper_large_training.py", line 619, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1123, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1179, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
trainable params: 21633024 || all params: 1564938496 || trainable%: 1.3823561791913386
Traceback (most recent call last):
File "peft_adalora_whisper_large_training.py", line 777, in
main()
File "peft_adalora_whisper_large_training.py", line 619, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1123, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1179, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
trainable params: 21633024 || all params: 1564938496 || trainable%: 1.3823561791913386
trainable params: 21633024 || all params: 1564938496 || trainable%: 1.3823561791913386
Traceback (most recent call last):
File "peft_adalora_whisper_large_training.py", line 777, in
main()
File "peft_adalora_whisper_large_training.py", line 619, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1123, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1179, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
Traceback (most recent call last):
File "peft_adalora_whisper_large_training.py", line 777, in
main()
File "peft_adalora_whisper_large_training.py", line 619, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1123, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1179, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2099383) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 914, in launch_command
multi_gpu_launcher(args)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError

@Silypie
Copy link

Silypie commented May 10, 2023

I got the same error, my model is initialized as follows:

model = AutoModelForSeq2SeqLM.from_pretrained(
    args.model_name_or_path,
    config=config,
    load_in_8bit=True,
    device_map="auto" 
)

It seems that accelerator.prepare throws an exception. I also want to confirm if 'int8 Training' supports multiple GPUs.

@github-actions
Copy link

github-actions bot commented Jun 8, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@ingo-m
Copy link

ingo-m commented Jun 9, 2023

AFAIK one cannot backpropagate gradients through an 8bit model. The idea is to load the base model in 8bit, and the additional LoRA parameters in higher precision. During finetuning, only the higher-precision LoRA parameters are updated. To achieve this, we use load_in_8bit=True and model = prepare_model_for_int8_training(model), as explained here: huggingface/accelerate#1147 (comment)

However, with that approach, I also get

ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.

😐

@younesbelkada
Copy link
Contributor

This should have been fixed in huggingface/accelerate#1523
If you install accelerate latest version, everything should work

pip install --upgrade accelerate

Feel free to re-open the issue if you don't think so

@ingo-m
Copy link

ingo-m commented Jun 9, 2023

@younesbelkada Great, thanks! I will try with the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants