Enabling model parallelism (training 30b on 2x 3090s and beyond) #131

kooshi · 2023-03-23T05:04:24Z

Does as it says on the tin.
Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)

This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.

This also serves as a workaround for #8

HideLord · 2023-03-23T15:33:01Z

Perhaps it doesn't work on windows?
Fresh install of this fork, with MICRO_BATCH_SIZE = 64, only uses a single GPU, even though gpus = torch.cuda.device_count() correctly detects 2 GPUs.
ddp remains False because WORLD_SIZE is not set.
Setting PIPE_CHUNKS = 1 gives an error:

Traceback (most recent call last):
  File "E:\LLaMA-train\alpaca-lora\finetune.py", line 142, in <module>
    from torch.distributed.pipeline.sync import Pipe
  File "E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\pipeline\sync\__init__.py", line 9, in <module>
    from .pipe import Pipe, WithDevice
  File "E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\pipeline\sync\pipe.py", line 13, in <module>
    from torch.distributed.rpc import RRef
ImportError: cannot import name 'RRef' from 'torch.distributed.rpc' (E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\rpc\__init__.py)

Here are some relevant libs (transformers is also installed from the correct fork git+https://github.com/kooshi/transformers.git@llama-parallelism):

tokenizers              0.13.2
torch                   2.0.0
torchaudio              2.0.0
torchvision             0.15.0
tqdm                    4.65.0
transformers            4.28.0.dev0

kooshi · 2023-03-23T17:04:04Z

Hm, maybe the rpc bit isn't supported on windows. Keep PIPE_CHUNKS = 0, and try manually setting max_memory as below.

device_map = "auto" is supposed to distribute the model across gpus, but for some reason, when loading in 8bit, it simply doesn't. You can force the behavior with max_memory.

Each gpu should have roughly (size of model) / gpus.
So for 30b at 8bit on 2gpus it should be max_memory={0: "15GB", 1: "15GB"}
for 13b, max_memory={0: "7GB", 1: "7GB"}
and so on. It's unfortunate that you need to fiddle with it manually... maybe I'll look into why 8bit causes device_map to be ignored tonight.

Since you're on Windows, use MSI Afterburner or something similar to monitor the VRAM usage to make sure the model is loaded across both before training begins. For anyone on Linux, use nvtop.

HideLord · 2023-03-23T18:40:50Z

That kind of worked.

kooshi · 2023-03-23T19:15:13Z

Yeah, that's what you should expect to see. Model parallelism can't load both GPUs all the time. It just goes back and forth between them, but it means you can load larger models or use larger batch sizes. Some new frameworks have some fancy tricks to reduce the inefficiency, but this is just the simplest case. See here for more info: https://pytorch.org/docs/stable/pipeline.html

The Pipeline should be one of those tricks, but I don't think I fully implemented it. I'll need to play with that some more later.

But for now, yeah, looks like it's working as expected for you.

finetune.py

kooshi · 2023-03-24T16:36:34Z

I found the root cause of the device_map auto discrepancy in the transformers repo, so I'm going to draft this until I get that fixed and merged

sgsdxzy · 2023-03-25T18:21:00Z

It is possible to support Deepspeed stage 3 to do parameter partitioning to fit large model into multiple gpus? Will it be faster than the currently naive (non-overlapping) pipeline parallelism?

kooshi · 2023-03-25T19:18:47Z

I'm not sure, I'll need to look into Deepspeed more. I had played with it for a minute and I think it didn't support 8bit. I'll add it to my list of things to look at, because better parallelism would be awesome. I mostly know how to get full pipelining working, but Deepspeed would be more valuable.

sgsdxzy · 2023-03-25T19:33:51Z

How are you planning to implement full pipeline? I searched for examples and docs, and I think all leads to modifying the implementation of llama in transformers, which I would recognize as last resort.

kooshi · 2023-03-25T23:04:29Z

Correct. It's not trivial, but it's not terrible either.
I started working on a quick and dirty experiment of it the other day. It's in the llama-parallelism branch of my transformers fork.

I stopped when I realized I also need to batch the inputs into microbatches in a single tensor. I was also using the pipeline for a little more than what it was designed for, so it was breaking in weird ways. Please do take a look if you're interested though, be warned it's very hacky and broken.

AngainorDev · 2023-03-26T13:57:29Z

Hi,

Thanks for this work!

I'm experimenting multiple configs to find the best matches for my use cases.
Linux, 2x3090. I'm able to train 7b and 13b on both of them with ddp.

I'm now trying to train the 30b, but I keep getting OOM.

transformers was upgraded
world_size to 1, I made sure ddp was off.
2 gpus detected
I tried to force 15GB/15GB as max_memory

Still, while "Loading checkpoint shards:" it breaks with OOM, having filled the first GPU up, second one is almost unused.

Any idea what I could do wrong?

sgsdxzy · 2023-03-26T14:27:21Z

@AngainorDev how did you force max_memory? I edited finetune.py line 78 to be

model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=True,
        device_map=device_map,
        max_memory={0: "11776MB", 1: "11776MB", 2: "11776MB"}
    )

And I can train 30B on 2080Ti 22G x 3 withmicro_batch_size=16. But one epoch would take >30h because naive model parallel training is very inefficient.

kooshi · 2023-03-26T14:39:05Z

@AngainorDev I just pushed a change that references my fork of transformers. I was hoping they would merge the PR in quickly, but since they're a company, it seems like they won't get to it till Monday. To install it,

git pull
pip uninstall transformers
pip install -r transformers.txt

With that, you won't need to use a hard coded max_memory, and you can just use "auto" device map for a perfect distribution.

sgsdxzy · 2023-03-26T14:55:28Z

Correct. It's not trivial, but it's not terrible either. I started working on a quick and dirty experiment of it the other day. It's in the llama-parallelism branch of my transformers fork.

I stopped when I realized I also need to batch the inputs into microbatches in a single tensor. I was also using the pipeline for a little more than what it was designed for, so it was breaking in weird ways. Please do take a look if you're interested though, be warned it's very hacky and broken.

I find deepspeed pipeline parallelism very promising: you just need to change the input and output of each layer to tuple of tensor, and deepspeed can do the rest for you, including micro batching, etc. It has much relaxed constrants than pytorch pipe: you don't need to express the model in nn.Sequential (just a list of python callables), each layer does not need to be a nn.Module (any python callable), and the input/output can be a tuple of tensors, not limited to one tensor.
Because Llama has only one layer type: LlamaDecoderLayer, I thinks it could be relatively easy to wrap the layer in a wrapper that simply pack and unpack parameters as tuples.
Are you interested in implementing this? I might try to do it as well, but I am new to ML (I just installed torch weeks ago) so it might take me a long time before it can work.

AngainorDev · 2023-03-26T14:55:45Z

I just updated to git+https://github.com/kooshi/transformers.git@balanced_memory_8bit

how did you force max_memory? I edited finetune.py line 78 to be

I used max_memory={0: "15GB", 1: "15GB"},

This seems to have no effect, gpu 0 taking it all and oom at 70% of model loading.

kooshi · 2023-03-26T17:39:17Z

Sounds like it's not even seeing the second gpu as available or something. Make sure CUDA_VISIBLE_DEVICES is set correctly.

AngainorDev · 2023-03-26T19:12:17Z

Yeah,

But torch.cuda.device_count() correctly detects the 2 GPUs.
CUDA_VISIBLE_DEVICES was not set, I explicitely set it to CUDA_VISIBLE_DEVICES=0,1 , no change.

Second one gets a bit of vram used when running, around 1GB.
Both are successfully used with ddp on smaller models.

kooshi · 2023-03-27T15:22:37Z

My second PR for transformers was merged in, so now the only thing required to use model parallelism is reinstalling transformers, and merging the few lines left. I'm not sure what's going on with @AngainorDev, because in that case it's behaving as if it's ignoring the proven fixes of the manual max_memory, or the updated load_in_8bit logic in transformers. I have to imagine something is configured incorrectly or is somehow overriding the correct behavior.

@AngainorDev my next suggestion would be to attempt a clean slate. Set up a brand new conda environment, install the latest supported libraries, and run this code, unmodified, just to see if it can work at all before changes.

This PR is ready to be merged.

AngainorDev · 2023-03-27T15:59:17Z

Thanks for the follow up.
Agree, something could be broken in my setup, I'll do from a clean one next time I'll try, thanks!

KohakuBlueleaf · 2023-03-28T09:13:01Z

Already use this update to train with MP and it works well!
I train 13B model with 2x3090 with cutoff len 512 + batch size 24

tloen

LGTM — thanks for this!

AAAZSF · 2023-03-29T09:48:48Z

I successfully finetune the 30b model on multi gpu by pipeline parallelism
But when i set load_in_8bit=False，it cause RuntimeError:

  File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
    fire.Fire(train)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
    loss = self.compute_loss(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
    outputs = model(**inputs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
    return self.base_model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I hope for help, thanks!

sgsdxzy · 2023-03-29T09:53:56Z

I successfully finetune the 30b model on multi gpu by pipeline parallelism But when i set load_in_8bit=False，it cause RuntimeError:

  File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
    fire.Fire(train)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
    loss = self.compute_loss(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
    outputs = model(**inputs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
    return self.base_model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I hope for help, thanks!

Can you fit fp16 model in your VRAM? It seems you don't have enough vram and some layers are put on cpu.

AAAZSF · 2023-03-29T10:06:14Z

I successfully finetune the 30b model on multi gpu by pipeline parallelism But when i set load_in_8bit=False，it cause RuntimeError:

  File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
    fire.Fire(train)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
    loss = self.compute_loss(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
    outputs = model(**inputs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
    return self.base_model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I hope for help, thanks!

Can you fit fp16 model in your VRAM? It seems you don't have enough vram and some layers are put on cpu.

Sorry, I forgot to say that I set load_in_8bit=False in the 7b model.
I test 7b fp16 model on 2*24G gpus, so i think memory is enough.
more detail during runing:
nvidia-smi

>>> model.hf_device_map
{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'lm_head': 1}

KohakuBlueleaf · 2023-03-29T14:13:53Z

Make sure you guys has something like model.parallized = True (check the changed files)
or your model will blow up

and this error is not caused by OOM

RunhuiWang · 2023-04-14T19:56:24Z

Hi,

Thanks for this work!

I'm experimenting multiple configs to find the best matches for my use cases. Linux, 2x3090. I'm able to train 7b and 13b on both of them with ddp.

I'm now trying to train the 30b, but I keep getting OOM.

transformers was upgraded

world_size to 1, I made sure ddp was off.

2 gpus detected

I tried to force 15GB/15GB as max_memory

Still, while "Loading checkpoint shards:" it breaks with OOM, having filled the first GPU up, second one is almost unused.

Any idea what I could do wrong?

Are you using torchrun?

RunhuiWang · 2023-04-14T20:07:24Z

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)

This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.

This also serves as a workaround for #8

Could you provide a command line example that uses model parallelism on multiple GPU? I have tried

CUDA_VISIBLE_DEVICES=0,1 python finetune.py --base_model '/data/980pro2tb/LLAMA-hf/30B' --data_path 'yahma/alpaca-cleaned' --output_dir './lora-alpaca'

The model was split into two GPUs about evenly, but I got the error "../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [30,0,0], thread: [96,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed."

If I train a smaller model on a single GPU, then the error won't show up. DDP also works well on 2 GPUs.

I have also tried different versions of CUDA, nvidia-drivers, transformers, bitsandbytes, and llama models (even converted from original weights) but this error is still here.

kooshi · 2023-04-14T22:03:19Z

Yeah... this is new. It was also reported here: huggingface/transformers#22546

One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)

I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

RunhuiWang · 2023-04-14T22:14:26Z

Thanks for pointing me to that thread. I forgot to mention that I was using 4090s. I have also checked that thread earlier and tried his driver version, but no luck on 4090s. MP works well on 2x 3090s though.

RunhuiWang · 2023-04-18T21:12:05Z

Yeah... this is new. It was also reported here: huggingface/transformers#22546

One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)

I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!

kongbohu · 2023-04-25T16:06:41Z

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)

This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.

This also serves as a workaround for #8

So here is faster training (DDP) "OR" bigger models (MP). I have been searching ways for ddp "AND" mp but have no luck by now. Neither Deepspeed nor torchrun can give a clear clue.

kongbohu · 2023-04-25T16:29:19Z

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)
This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.
This also serves as a workaround for #8

So here is faster training (DDP) "OR" bigger models (MP). I have been searching ways for ddp "AND" mp but have no luck by now. Neither Deepspeed nor torchrun can give a clear clue.

deepspeed do support MP, but seems only in inference part -- hope someone could correct me if I were wrong

vans163 · 2023-04-27T18:51:56Z

Yeah... this is new. It was also reported here: huggingface/transformers#22546
One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)
I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!

doesnt work on ubuntu 18.04, is 20.04 some magic version? I doubt it. What version of cuda do you have?

JiexingQi · 2023-05-15T06:11:30Z

I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!
doesnt work on ubuntu 18.04, is 20.04 some magic version? I doubt it. What version of cuda do you have?

My system is Ubuntu 20.04， but still meet this problem. Have you solved this problem? @vans163

JiexingQi · 2023-05-15T06:54:17Z

My 4090 does not work, but the A100 works.

vans163 · 2023-05-15T14:51:37Z

nope I did not get it working

RunhuiWang · 2023-05-15T15:48:24Z

Yeah... this is new. It was also reported here: huggingface/transformers#22546
One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)
I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!

doesnt work on ubuntu 18.04, is 20.04 some magic version? I doubt it. What version of cuda do you have?

python3.8, CUDA 12.0, Driver 525.105.17

This was referenced Mar 23, 2023

runtime error: mat1 and mat2 shapes cannot be multiplied #8

Open

Successfully run training in 4bit mode, while the training speed is very slow #56

Open

KohakuBlueleaf reviewed Mar 24, 2023

View reviewed changes

finetune.py Outdated Show resolved Hide resolved

kooshi marked this pull request as draft March 24, 2023 16:35

override broken data parallelism with model parallelism

473254d

kooshi force-pushed the llama-parallelism branch from 7bb2a6d to 473254d Compare March 27, 2023 15:13

kooshi marked this pull request as ready for review March 27, 2023 15:23

tloen added 2 commits March 28, 2023 08:40

formatting

ee66a5c

formatting, again

bdc0c6b

tloen approved these changes Mar 28, 2023

View reviewed changes

tloen merged commit 55b664f into tloen:main Mar 28, 2023

AngainorDev mentioned this pull request Mar 29, 2023

LoRA finetuning a larger foundation model on multi-GPUs #204

Open

AAAZSF mentioned this pull request Apr 7, 2023

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #287

Open

Enabling model parallelism (training 30b on 2x 3090s and beyond) #131

Enabling model parallelism (training 30b on 2x 3090s and beyond) #131

Conversation

kooshi commented Mar 23, 2023 • edited Loading

HideLord commented Mar 23, 2023

kooshi commented Mar 23, 2023

HideLord commented Mar 23, 2023

kooshi commented Mar 23, 2023 • edited Loading

kooshi commented Mar 24, 2023

sgsdxzy commented Mar 25, 2023

kooshi commented Mar 25, 2023

sgsdxzy commented Mar 25, 2023

kooshi commented Mar 25, 2023

AngainorDev commented Mar 26, 2023

sgsdxzy commented Mar 26, 2023

kooshi commented Mar 26, 2023

sgsdxzy commented Mar 26, 2023

AngainorDev commented Mar 26, 2023

kooshi commented Mar 26, 2023

AngainorDev commented Mar 26, 2023

kooshi commented Mar 27, 2023

AngainorDev commented Mar 27, 2023

KohakuBlueleaf commented Mar 28, 2023

tloen left a comment

Choose a reason for hiding this comment

AAAZSF commented Mar 29, 2023

sgsdxzy commented Mar 29, 2023

AAAZSF commented Mar 29, 2023 • edited Loading

KohakuBlueleaf commented Mar 29, 2023

RunhuiWang commented Apr 14, 2023

RunhuiWang commented Apr 14, 2023 • edited Loading

kooshi commented Apr 14, 2023

RunhuiWang commented Apr 14, 2023

RunhuiWang commented Apr 18, 2023

kongbohu commented Apr 25, 2023

kongbohu commented Apr 25, 2023

vans163 commented Apr 27, 2023

JiexingQi commented May 15, 2023 • edited Loading

JiexingQi commented May 15, 2023

vans163 commented May 15, 2023

RunhuiWang commented May 15, 2023

kooshi commented Mar 23, 2023 •

edited

Loading

kooshi commented Mar 23, 2023 •

edited

Loading

AAAZSF commented Mar 29, 2023 •

edited

Loading

RunhuiWang commented Apr 14, 2023 •

edited

Loading

JiexingQi commented May 15, 2023 •

edited

Loading