-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime error: mat1 and mat2 shapes cannot be multiplied #8
Comments
A6000, same error happened with one GPU or two GPUs training. |
Yeah, seems to be: bitsandbytes-foundation/bitsandbytes#162 |
Strange, on my friend's machine with a 2x3090, he gets this error (Ubuntu 20.04 - Cuda 11.2), even though training on a single 3090. However, on another machine I'm using on the cloud with similar specs, but with only 1x3090 (Ubuntu 20.04 - Cuda 12.1) I do not get this error. Could it have something to do with having 2x 3090's installed, even though only training on one? Or maybe it could be the cuda version? The working machine with a single 3090 is running Cuda 12 |
ran into the same with 4xV100. any tips appreciated |
Force training only on a single GPU fixed it for us: |
Had the same when training on 2 gpus, using just python finetune.py Got it running on both using torchrun
Make sure your settings are consistent (gradient accumulation, micro steps) |
I had the same issue with 2xA6000 setup. Forcing training on a single GPU fixed it. |
I fixed the issue above code |
For everyone dealing with this, it's because BitsAndBytes doesn't play nice with Trainer when it tries to do DataParallelism. We're not actually missing out, as DataParallelism is quite slow, and, as referenced above, finetune.py supports DistributedDataParallelism, which you can achieve with torchrun, and is much faster. I also just submitted a PR to run this with Model Parallelism, so you can use multiple GPUs to run larger models. |
@kooshi Would this PR allow pipeline parallelism for inference on llama as well? Would it be possible to have a parallel sample for generate.py? |
That's a good question. I haven't tried it, but I think it should at least run. If you have more than two GPUs, take the whole block from under I'm not sure if pipelining even helps with inference though |
|
What if I have five? Changing WORLD_SIZE=5, --nproc_per_node=5 and doing cuda_visibility to 0,1,2,3,4 gets me a segfault
edit, actually I get this with just one |
@sfxworks yes. me too. Have you solved it? |
I get a similar issue with falcon but not on their official colab:
|
How do you force training on a single GPU when running "python finetune.py "? |
In my case, it don't work. I add the above code into "finetune.py". When running "python finetune.py", all GPU still are used. |
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 1628, in train
return inner_training_loop(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 1895, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 2637, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 2669, in compute_loss
outputs = model(**inputs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 529, in forward
return self.base_model(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 852, in forward
outputs = self.model.decoder(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 616, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 612, in custom_forward
return module(*inputs, output_attentions, None)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 167, in forward
value_states = self.v_proj(hidden_states).view(bsz, tgt_len, self.num_heads, self.head_dim)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 522, in forward
result = super().forward(x)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x7 and 8x4096)
The text was updated successfully, but these errors were encountered: