You can't train a model that has been loaded in 8-bit precision on multiple devices #414

mn9891 · 2023-05-08T15:52:15Z

Hi,

I was trying to tune Whisper using the script shared in the examples (this one here)
while using multiple GPUs (4), however that throws an error saying:
You can't train a model that has been loaded in 8-bit precision on multiple devices

Is it supported to train a 8bit whisper model on multiple GPUs?
I note that the using 1GPU works fine using the same script. Am I missing something?

The full error message:

Detected 8-bit loading: activating 8-bit loading for this model
All model checkpoint weights were used when initializing WhisperForConditionalGeneration.

All the weights of WhisperForConditionalGeneration were initialized from the model checkpoint at openai/whisper-large-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use WhisperForConditionalGeneration for predictions without further training.
loading configuration file generation_config.json from cache at /home/user17/.cache/huggingface/hub/models--openai--whisper-large-v2/snapshots/1f66457e6e36eeb6d89078882a39003e55c330b8/generation_config.json
Generation config file not found, using a generation config created from the model config.
trainable params: 21633024 || all params: 1564938496 || trainable%: 1.3823561791913386
Traceback (most recent call last):
File "peft_adalora_whisper_large_training.py", line 777, in
main()
File "peft_adalora_whisper_large_training.py", line 619, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1123, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1179, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
trainable params: 21633024 || all params: 1564938496 || trainable%: 1.3823561791913386
Traceback (most recent call last):
File "peft_adalora_whisper_large_training.py", line 777, in
main()
File "peft_adalora_whisper_large_training.py", line 619, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1123, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1179, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
trainable params: 21633024 || all params: 1564938496 || trainable%: 1.3823561791913386
trainable params: 21633024 || all params: 1564938496 || trainable%: 1.3823561791913386
Traceback (most recent call last):
File "peft_adalora_whisper_large_training.py", line 777, in
main()
File "peft_adalora_whisper_large_training.py", line 619, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1123, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1179, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
Traceback (most recent call last):
File "peft_adalora_whisper_large_training.py", line 777, in
main()
File "peft_adalora_whisper_large_training.py", line 619, in main
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1122, in prepare
result = tuple(
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1123, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 977, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/accelerator.py", line 1179, in prepare_model
raise ValueError(
ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2099383) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 914, in launch_command
multi_gpu_launcher(args)
File "/home/user17/.local/lib/python3.8/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError

The text was updated successfully, but these errors were encountered:

Silypie · 2023-05-10T10:59:44Z

I got the same error, my model is initialized as follows:

model = AutoModelForSeq2SeqLM.from_pretrained(
    args.model_name_or_path,
    config=config,
    load_in_8bit=True,
    device_map="auto" 
)

It seems that accelerator.prepare throws an exception. I also want to confirm if 'int8 Training' supports multiple GPUs.

github-actions · 2023-06-08T15:03:28Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

ingo-m · 2023-06-09T11:06:48Z

AFAIK one cannot backpropagate gradients through an 8bit model. The idea is to load the base model in 8bit, and the additional LoRA parameters in higher precision. During finetuning, only the higher-precision LoRA parameters are updated. To achieve this, we use load_in_8bit=True and model = prepare_model_for_int8_training(model), as explained here: huggingface/accelerate#1147 (comment)

However, with that approach, I also get

ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.

😐

younesbelkada · 2023-06-09T11:38:06Z

This should have been fixed in huggingface/accelerate#1523
If you install accelerate latest version, everything should work

pip install --upgrade accelerate

Feel free to re-open the issue if you don't think so

ingo-m · 2023-06-09T11:42:15Z

@younesbelkada Great, thanks! I will try with the latest version.

younesbelkada closed this as completed Jun 9, 2023

zhaoyang110157 mentioned this issue Jun 24, 2023

ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. artidoro/qlora#186

Closed

niccolor mentioned this issue Aug 12, 2023

finetuning with PEFT int-8bit + LoRA on single node multiGPU was working, now doesn't any more huggingface/accelerate#1840

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

You can't train a model that has been loaded in 8-bit precision on multiple devices #414

You can't train a model that has been loaded in 8-bit precision on multiple devices #414

mn9891 commented May 8, 2023 •

edited

Loading

Silypie commented May 10, 2023

github-actions bot commented Jun 8, 2023

ingo-m commented Jun 9, 2023

younesbelkada commented Jun 9, 2023

ingo-m commented Jun 9, 2023

You can't train a model that has been loaded in 8-bit precision on multiple devices #414

You can't train a model that has been loaded in 8-bit precision on multiple devices #414

Comments

mn9891 commented May 8, 2023 • edited Loading

Silypie commented May 10, 2023

github-actions bot commented Jun 8, 2023

ingo-m commented Jun 9, 2023

younesbelkada commented Jun 9, 2023

ingo-m commented Jun 9, 2023

mn9891 commented May 8, 2023 •

edited

Loading