Multi-gpu training still has issues #242

macabdul9 · 2023-03-31T02:01:14Z

With int8
Error: RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Without int8
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__native_layer_norm)

Even after #145

Details: I am training whisper-large-v2 and using pretty everything from example notebook here: https://github.com/huggingface/peft/blob/main/examples/int8_training/peft_bnb_whisper_large_v2_training.ipynb

Similar issue: #205

cc: @pacman100

pacman100 · 2023-03-31T05:03:02Z

Looks like Multi-GPU training with naive pipeline using accelerate's device map fails for encoder-decoder models (#205 had T5 and this issue observes it for Whisper). @younesbelkada any ideas on what might be happening?

pacman100 · 2023-03-31T07:35:42Z

For Whisper multi-gpu naive pp using accelerate and peft and trainer, following changes are required:

Base model loading:

from transformers import WhisperForConditionalGeneration
import copy
from accelerate import dispatch_model
model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path, load_in_8bit=True, device_map="auto")
device_map = model.hf_device_map.copy()

# required because `labels` are on main execution device (0) while the output of `proj_out` is on other device.
# So, this leads to device mismatch error when calculation cross-entropy between logits and labels.
# Won't arise during inference as `labels` aren't supplied during that time
# instead of changing device of one of the tied modules, I have to do this for all tied modules 
# else the execution device of remaining tied modules isn't changed
device_map["model.decoder.embed_tokens"] = model._hf_hook.execution_device
device_map["model.decoder.embed_positions"] = model._hf_hook.execution_device
device_map["proj_out"] = model._hf_hook.execution_device
dispatch_model(model, device_map=device_map)

print(model.model.decoder.embed_tokens._hf_hook)
print(model.proj_out._hf_hook)
model.hf_device_map

output logs:

AlignDeviceHook(execution_device=0, offload=False, io_same_device=False, offload_buffers=False, place_submodules=True)
AlignDeviceHook(execution_device=0, offload=False, io_same_device=False, offload_buffers=False, place_submodules=True)
{'model.encoder.conv1': 0,
 'model.encoder.conv2': 0,
 'model.encoder.embed_positions': 0,
 'model.encoder.layers.0': 0,
 'model.encoder.layers.1': 0,
 'model.encoder.layers.2': 0,
 'model.encoder.layers.3': 0,
 'model.encoder.layers.4': 0,
 'model.encoder.layers.5': 0,
 'model.encoder.layers.6': 0,
 'model.encoder.layers.7': 0,
 'model.encoder.layers.8': 0,
 'model.encoder.layers.9': 0,
 'model.encoder.layers.10': 0,
 'model.encoder.layers.11': 0,
 'model.encoder.layers.12': 0,
 'model.encoder.layers.13': 0,
 'model.encoder.layers.14': 0,
 'model.encoder.layers.15': 0,
 'model.encoder.layers.16': 0,
 'model.encoder.layers.17': 0,
 'model.encoder.layers.18': 0,
 'model.encoder.layers.19': 1,
 'model.encoder.layers.20': 1,
 'model.encoder.layers.21': 1,
 'model.encoder.layers.22': 1,
 'model.encoder.layers.23': 1,
 'model.encoder.layers.24': 1,
 'model.encoder.layers.25': 1,
 'model.encoder.layers.26': 1,
 'model.encoder.layers.27': 1,
 'model.encoder.layers.28': 1,
 'model.encoder.layers.29': 1,
 'model.encoder.layers.30': 1,
 'model.encoder.layers.31': 1,
 'model.encoder.layer_norm': 1,
 'model.decoder.embed_tokens': 0,
 'proj_out': 0,
 'model.decoder.embed_positions': 0,
 'model.decoder.layers.0': 1,
 'model.decoder.layers.1': 1,
 'model.decoder.layers.2': 1,
 'model.decoder.layers.3': 1,
 'model.decoder.layers.4': 2,
 'model.decoder.layers.5': 2,
 'model.decoder.layers.6': 2,
 'model.decoder.layers.7': 2,
 'model.decoder.layers.8': 2,
 'model.decoder.layers.9': 2,
 'model.decoder.layers.10': 2,
 'model.decoder.layers.11': 2,
 'model.decoder.layers.12': 2,
 'model.decoder.layers.13': 2,
 'model.decoder.layers.14': 2,
 'model.decoder.layers.15': 2,
 'model.decoder.layers.16': 2,
 'model.decoder.layers.17': 2,
 'model.decoder.layers.18': 2,
 'model.decoder.layers.19': 2,
 'model.decoder.layers.20': 2,
 'model.decoder.layers.21': 3,
 'model.decoder.layers.22': 3,
 'model.decoder.layers.23': 3,
 'model.decoder.layers.24': 3,
 'model.decoder.layers.25': 3,
 'model.decoder.layers.26': 3,
 'model.decoder.layers.27': 3,
 'model.decoder.layers.28': 3,
 'model.decoder.layers.29': 3,
 'model.decoder.layers.30': 3,
 'model.decoder.layers.31': 3,
 'model.decoder.layer_norm': 3}

To avoid trainer using DP as this is naive pp => model parallelism:

setattr(model, 'model_parallel', True)
setattr(model, 'is_parallelizable', True)

It should now work properly

younesbelkada · 2023-03-31T07:46:11Z

hi @pacman100 @macabdul9
Please see: #205 (comment)
These changes were required to enable multi-gpu training with Trainer

pacman100 · 2023-03-31T07:57:36Z

Hello @younes, I did mention it in the above comment, Thank you for mentioning it on the other issue too:

To avoid trainer using DP as this is naive pp => model parallelism:
setattr(model, 'model_parallel', True)
setattr(model, 'is_parallelizable', True)

younesbelkada · 2023-03-31T07:58:25Z

Awesome thanks a lot!

macabdul9 · 2023-04-01T01:35:53Z

It's still not working. @pacman100 can you share your working notebook if possible?

AttentionAllUNeed · 2023-04-06T01:16:48Z

Hi, I have the same problem, do you solve it now? @macabdul9

mn9891 · 2023-04-07T16:14:51Z

Same here, any solutions?

cchen-dialpad · 2023-05-04T06:02:06Z

Not sure why, but I found that using DDP instead of DP works without this problem, i.e., python -m torch.distributed.launch --nproc_per_node=2 your_script.py

github-actions · 2023-05-28T15:03:32Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

rrva · 2023-08-24T08:06:33Z

I have made a PR #855 with some of the changes mentioned in this thread, but I still have issues with lora_config rank_pattern becoming None

phineas-pta · 2024-04-05T00:54:57Z

here what i did to enable distributed data parallelism:

from accelerate import Accelerator
…
accelerator = Accelerator(…)
…
model = WhisperForConditionalGeneration.from_pretrained(…, device_map={"": accelerator.device})
…
training_args = Seq2SeqTrainingArgument(…, accelerator_config={"split_batches": True})
…
trainer.train()
accelerator.wait_for_everyone()
if accelerator.is_main_process:
	trainer.save_model()

then launch with python -m accelerate.commands.launch --multi_gpu --num_machines 1 --num_processes 2 your_script.py if u have 2 gpu for example

pacman100 added the solved solved label Mar 31, 2023

younesbelkada mentioned this issue Apr 3, 2023

[T5] Enable naive Pipeline Parallelism training for T5 huggingface/transformers#22535

Merged

github-actions bot closed this as completed Jun 5, 2023

rrva mentioned this issue Aug 23, 2023

Distributed training not working for PEFT/AdaLoRA whisper #855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu training still has issues #242

Multi-gpu training still has issues #242

macabdul9 commented Mar 31, 2023 •

edited

Loading

pacman100 commented Mar 31, 2023

pacman100 commented Mar 31, 2023 •

edited

Loading

younesbelkada commented Mar 31, 2023

pacman100 commented Mar 31, 2023

younesbelkada commented Mar 31, 2023

macabdul9 commented Apr 1, 2023

AttentionAllUNeed commented Apr 6, 2023 •

edited

Loading

mn9891 commented Apr 7, 2023

cchen-dialpad commented May 4, 2023

github-actions bot commented May 28, 2023

rrva commented Aug 24, 2023

phineas-pta commented Apr 5, 2024

Multi-gpu training still has issues #242

Multi-gpu training still has issues #242

Comments

macabdul9 commented Mar 31, 2023 • edited Loading

pacman100 commented Mar 31, 2023

pacman100 commented Mar 31, 2023 • edited Loading

younesbelkada commented Mar 31, 2023

pacman100 commented Mar 31, 2023

younesbelkada commented Mar 31, 2023

macabdul9 commented Apr 1, 2023

AttentionAllUNeed commented Apr 6, 2023 • edited Loading

mn9891 commented Apr 7, 2023

cchen-dialpad commented May 4, 2023

github-actions bot commented May 28, 2023

rrva commented Aug 24, 2023

phineas-pta commented Apr 5, 2024

macabdul9 commented Mar 31, 2023 •

edited

Loading

pacman100 commented Mar 31, 2023 •

edited

Loading

AttentionAllUNeed commented Apr 6, 2023 •

edited

Loading