Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling model parallelism (training 30b on 2x 3090s and beyond) #131

Merged
merged 3 commits into from
Mar 28, 2023

Conversation

kooshi
Copy link
Contributor

@kooshi kooshi commented Mar 23, 2023

Does as it says on the tin.
Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)

This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.

This also serves as a workaround for #8

@HideLord
Copy link

Perhaps it doesn't work on windows?
Fresh install of this fork, with MICRO_BATCH_SIZE = 64, only uses a single GPU, even though gpus = torch.cuda.device_count() correctly detects 2 GPUs.
ddp remains False because WORLD_SIZE is not set.
Setting PIPE_CHUNKS = 1 gives an error:

Traceback (most recent call last):
  File "E:\LLaMA-train\alpaca-lora\finetune.py", line 142, in <module>
    from torch.distributed.pipeline.sync import Pipe
  File "E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\pipeline\sync\__init__.py", line 9, in <module>
    from .pipe import Pipe, WithDevice
  File "E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\pipeline\sync\pipe.py", line 13, in <module>
    from torch.distributed.rpc import RRef
ImportError: cannot import name 'RRef' from 'torch.distributed.rpc' (E:\LLaMA-train\installer_files\env\lib\site-packages\torch\distributed\rpc\__init__.py)

Here are some relevant libs (transformers is also installed from the correct fork git+https://github.com/kooshi/transformers.git@llama-parallelism):

tokenizers              0.13.2
torch                   2.0.0
torchaudio              2.0.0
torchvision             0.15.0
tqdm                    4.65.0
transformers            4.28.0.dev0

@kooshi
Copy link
Contributor Author

kooshi commented Mar 23, 2023

Hm, maybe the rpc bit isn't supported on windows. Keep PIPE_CHUNKS = 0, and try manually setting max_memory as below.

device_map = "auto" is supposed to distribute the model across gpus, but for some reason, when loading in 8bit, it simply doesn't. You can force the behavior with max_memory.

Each gpu should have roughly (size of model) / gpus.
So for 30b at 8bit on 2gpus it should be max_memory={0: "15GB", 1: "15GB"}
for 13b, max_memory={0: "7GB", 1: "7GB"}
and so on. It's unfortunate that you need to fiddle with it manually... maybe I'll look into why 8bit causes device_map to be ignored tonight.

Since you're on Windows, use MSI Afterburner or something similar to monitor the VRAM usage to make sure the model is loaded across both before training begins. For anyone on Linux, use nvtop.

@HideLord
Copy link

That kind of worked.
file

@kooshi
Copy link
Contributor Author

kooshi commented Mar 23, 2023

Yeah, that's what you should expect to see. Model parallelism can't load both GPUs all the time. It just goes back and forth between them, but it means you can load larger models or use larger batch sizes. Some new frameworks have some fancy tricks to reduce the inefficiency, but this is just the simplest case. See here for more info: https://pytorch.org/docs/stable/pipeline.html

The Pipeline should be one of those tricks, but I don't think I fully implemented it. I'll need to play with that some more later.

But for now, yeah, looks like it's working as expected for you.

finetune.py Outdated Show resolved Hide resolved
@kooshi kooshi marked this pull request as draft March 24, 2023 16:35
@kooshi
Copy link
Contributor Author

kooshi commented Mar 24, 2023

I found the root cause of the device_map auto discrepancy in the transformers repo, so I'm going to draft this until I get that fixed and merged

@sgsdxzy
Copy link

sgsdxzy commented Mar 25, 2023

It is possible to support Deepspeed stage 3 to do parameter partitioning to fit large model into multiple gpus? Will it be faster than the currently naive (non-overlapping) pipeline parallelism?

@kooshi
Copy link
Contributor Author

kooshi commented Mar 25, 2023

I'm not sure, I'll need to look into Deepspeed more. I had played with it for a minute and I think it didn't support 8bit. I'll add it to my list of things to look at, because better parallelism would be awesome. I mostly know how to get full pipelining working, but Deepspeed would be more valuable.

@sgsdxzy
Copy link

sgsdxzy commented Mar 25, 2023

How are you planning to implement full pipeline? I searched for examples and docs, and I think all leads to modifying the implementation of llama in transformers, which I would recognize as last resort.

@kooshi
Copy link
Contributor Author

kooshi commented Mar 25, 2023

Correct. It's not trivial, but it's not terrible either.
I started working on a quick and dirty experiment of it the other day. It's in the llama-parallelism branch of my transformers fork.

I stopped when I realized I also need to batch the inputs into microbatches in a single tensor. I was also using the pipeline for a little more than what it was designed for, so it was breaking in weird ways. Please do take a look if you're interested though, be warned it's very hacky and broken.

@AngainorDev
Copy link
Contributor

Hi,

Thanks for this work!

I'm experimenting multiple configs to find the best matches for my use cases.
Linux, 2x3090. I'm able to train 7b and 13b on both of them with ddp.

I'm now trying to train the 30b, but I keep getting OOM.

  • transformers was upgraded
  • world_size to 1, I made sure ddp was off.
  • 2 gpus detected
  • I tried to force 15GB/15GB as max_memory

Still, while "Loading checkpoint shards:" it breaks with OOM, having filled the first GPU up, second one is almost unused.

Any idea what I could do wrong?

@sgsdxzy
Copy link

sgsdxzy commented Mar 26, 2023

@AngainorDev how did you force max_memory? I edited finetune.py line 78 to be

model = LlamaForCausalLM.from_pretrained(
        base_model,
        load_in_8bit=True,
        device_map=device_map,
        max_memory={0: "11776MB", 1: "11776MB", 2: "11776MB"}
    )

And I can train 30B on 2080Ti 22G x 3 withmicro_batch_size=16. But one epoch would take >30h because naive model parallel training is very inefficient.

@kooshi
Copy link
Contributor Author

kooshi commented Mar 26, 2023

@AngainorDev I just pushed a change that references my fork of transformers. I was hoping they would merge the PR in quickly, but since they're a company, it seems like they won't get to it till Monday. To install it,

git pull
pip uninstall transformers
pip install -r transformers.txt

With that, you won't need to use a hard coded max_memory, and you can just use "auto" device map for a perfect distribution.

@sgsdxzy
Copy link

sgsdxzy commented Mar 26, 2023

Correct. It's not trivial, but it's not terrible either. I started working on a quick and dirty experiment of it the other day. It's in the llama-parallelism branch of my transformers fork.

I stopped when I realized I also need to batch the inputs into microbatches in a single tensor. I was also using the pipeline for a little more than what it was designed for, so it was breaking in weird ways. Please do take a look if you're interested though, be warned it's very hacky and broken.

I find deepspeed pipeline parallelism very promising: you just need to change the input and output of each layer to tuple of tensor, and deepspeed can do the rest for you, including micro batching, etc. It has much relaxed constrants than pytorch pipe: you don't need to express the model in nn.Sequential (just a list of python callables), each layer does not need to be a nn.Module (any python callable), and the input/output can be a tuple of tensors, not limited to one tensor.
Because Llama has only one layer type: LlamaDecoderLayer, I thinks it could be relatively easy to wrap the layer in a wrapper that simply pack and unpack parameters as tuples.
Are you interested in implementing this? I might try to do it as well, but I am new to ML (I just installed torch weeks ago) so it might take me a long time before it can work.

@AngainorDev
Copy link
Contributor

I just updated to git+https://github.com/kooshi/transformers.git@balanced_memory_8bit

how did you force max_memory? I edited finetune.py line 78 to be

I used max_memory={0: "15GB", 1: "15GB"},

This seems to have no effect, gpu 0 taking it all and oom at 70% of model loading.

@kooshi
Copy link
Contributor Author

kooshi commented Mar 26, 2023

Sounds like it's not even seeing the second gpu as available or something. Make sure CUDA_VISIBLE_DEVICES is set correctly.

@AngainorDev
Copy link
Contributor

Yeah,

But torch.cuda.device_count() correctly detects the 2 GPUs.
CUDA_VISIBLE_DEVICES was not set, I explicitely set it to CUDA_VISIBLE_DEVICES=0,1 , no change.

Second one gets a bit of vram used when running, around 1GB.
Both are successfully used with ddp on smaller models.

@kooshi
Copy link
Contributor Author

kooshi commented Mar 27, 2023

My second PR for transformers was merged in, so now the only thing required to use model parallelism is reinstalling transformers, and merging the few lines left. I'm not sure what's going on with @AngainorDev, because in that case it's behaving as if it's ignoring the proven fixes of the manual max_memory, or the updated load_in_8bit logic in transformers. I have to imagine something is configured incorrectly or is somehow overriding the correct behavior.

@AngainorDev my next suggestion would be to attempt a clean slate. Set up a brand new conda environment, install the latest supported libraries, and run this code, unmodified, just to see if it can work at all before changes.

This PR is ready to be merged.

@kooshi kooshi marked this pull request as ready for review March 27, 2023 15:23
@AngainorDev
Copy link
Contributor

Thanks for the follow up.
Agree, something could be broken in my setup, I'll do from a clean one next time I'll try, thanks!

@KohakuBlueleaf
Copy link
Contributor

Already use this update to train with MP and it works well!
I train 13B model with 2x3090 with cutoff len 512 + batch size 24

Copy link
Owner

@tloen tloen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — thanks for this!

@tloen tloen merged commit 55b664f into tloen:main Mar 28, 2023
@AAAZSF
Copy link

AAAZSF commented Mar 29, 2023

I successfully finetune the 30b model on multi gpu by pipeline parallelism
But when i set load_in_8bit=False,it cause RuntimeError:

  File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
    fire.Fire(train)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
    loss = self.compute_loss(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
    outputs = model(**inputs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
    return self.base_model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I hope for help, thanks!

@sgsdxzy
Copy link

sgsdxzy commented Mar 29, 2023

I successfully finetune the 30b model on multi gpu by pipeline parallelism But when i set load_in_8bit=False,it cause RuntimeError:

  File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
    fire.Fire(train)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
    loss = self.compute_loss(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
    outputs = model(**inputs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
    return self.base_model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I hope for help, thanks!

Can you fit fp16 model in your VRAM? It seems you don't have enough vram and some layers are put on cpu.

@AAAZSF
Copy link

AAAZSF commented Mar 29, 2023

I successfully finetune the 30b model on multi gpu by pipeline parallelism But when i set load_in_8bit=False,it cause RuntimeError:

  File "/home/usr/project/alpaca-lora/finetune.py", line 288, in <module>
    fire.Fire(train)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/usr/project/alpaca-lora/finetune.py", line 255, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1636, in train
    return inner_training_loop(
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 1903, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2649, in training_step
    loss = self.compute_loss(model, inputs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/trainer.py", line 2681, in compute_loss
    outputs = model(**inputs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/peft_model.py", line 530, in forward
    return self.base_model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/media/data/6/usr/tmp/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/peft/tuners/lora.py", line 350, in forward
    result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/usr/.conda/envs/lora/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I hope for help, thanks!

Can you fit fp16 model in your VRAM? It seems you don't have enough vram and some layers are put on cpu.

Sorry, I forgot to say that I set load_in_8bit=False in the 7b model.
I test 7b fp16 model on 2*24G gpus, so i think memory is enough.
more detail during runing:
nvidia-smi
image

>>> model.hf_device_map
{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.layers.28': 1, 'model.layers.29': 1, 'model.layers.30': 1, 'model.layers.31': 1, 'model.norm': 1, 'lm_head': 1}

@KohakuBlueleaf
Copy link
Contributor

Make sure you guys has something like model.parallized = True (check the changed files)
or your model will blow up

and this error is not caused by OOM

@RunhuiWang
Copy link

Hi,

Thanks for this work!

I'm experimenting multiple configs to find the best matches for my use cases. Linux, 2x3090. I'm able to train 7b and 13b on both of them with ddp.

I'm now trying to train the 30b, but I keep getting OOM.

  • transformers was upgraded
  • world_size to 1, I made sure ddp was off.
  • 2 gpus detected
  • I tried to force 15GB/15GB as max_memory

Still, while "Loading checkpoint shards:" it breaks with OOM, having filled the first GPU up, second one is almost unused.

Any idea what I could do wrong?

Are you using torchrun?

@RunhuiWang
Copy link

RunhuiWang commented Apr 14, 2023

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)

This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.

This also serves as a workaround for #8

Could you provide a command line example that uses model parallelism on multiple GPU? I have tried

CUDA_VISIBLE_DEVICES=0,1 python finetune.py --base_model '/data/980pro2tb/LLAMA-hf/30B' --data_path 'yahma/alpaca-cleaned' --output_dir './lora-alpaca'

The model was split into two GPUs about evenly, but I got the error "../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [30,0,0], thread: [96,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed."

If I train a smaller model on a single GPU, then the error won't show up. DDP also works well on 2 GPUs.

I have also tried different versions of CUDA, nvidia-drivers, transformers, bitsandbytes, and llama models (even converted from original weights) but this error is still here.

@kooshi
Copy link
Contributor Author

kooshi commented Apr 14, 2023

Yeah... this is new. It was also reported here: huggingface/transformers#22546

One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)

I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

@RunhuiWang
Copy link

Thanks for pointing me to that thread. I forgot to mention that I was using 4090s. I have also checked that thread earlier and tried his driver version, but no luck on 4090s. MP works well on 2x 3090s though.

@RunhuiWang
Copy link

Yeah... this is new. It was also reported here: huggingface/transformers#22546

One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)

I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!

@kongbohu
Copy link

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)

This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.

This also serves as a workaround for #8

So here is faster training (DDP) "OR" bigger models (MP). I have been searching ways for ddp "AND" mp but have no luck by now. Neither Deepspeed nor torchrun can give a clear clue.

@kongbohu
Copy link

Does as it says on the tin. Now multi-gpu users can choose to use them for faster training (DDP) or bigger models (MP)
This required a minor change to the transformers library. It has been merged: PR. Just update by reinstalling the transformers module.
This also serves as a workaround for #8

So here is faster training (DDP) "OR" bigger models (MP). I have been searching ways for ddp "AND" mp but have no luck by now. Neither Deepspeed nor torchrun can give a clear clue.

deepspeed do support MP, but seems only in inference part -- hope someone could correct me if I were wrong

@vans163
Copy link

vans163 commented Apr 27, 2023

Yeah... this is new. It was also reported here: huggingface/transformers#22546
One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)
I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!

doesnt work on ubuntu 18.04, is 20.04 some magic version? I doubt it. What version of cuda do you have?

@JiexingQi
Copy link

JiexingQi commented May 15, 2023

I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!
doesnt work on ubuntu 18.04, is 20.04 some magic version? I doubt it. What version of cuda do you have?

My system is Ubuntu 20.04, but still meet this problem. Have you solved this problem? @vans163

@JiexingQi
Copy link

My 4090 does not work, but the A100 works.

@vans163
Copy link

vans163 commented May 15, 2023

nope I did not get it working

@RunhuiWang
Copy link

Yeah... this is new. It was also reported here: huggingface/transformers#22546
One guy there noticed the only difference was his driver version: huggingface/transformers#22546 (comment)
I haven't seen it yet, but I haven't been training recently. I may have some time to check it out this weekend, but it's likely beyond my knowledge.

I think this issue might be related to ubuntu 22.04. I downgraded my system to ubuntu 20.04 and everything works fine. Thanks a lot for your effort in this project!

doesnt work on ubuntu 18.04, is 20.04 some magic version? I doubt it. What version of cuda do you have?

python3.8, CUDA 12.0, Driver 525.105.17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.