Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime error: mat1 and mat2 shapes cannot be multiplied #8

Open
ijustloveses opened this issue Mar 15, 2023 · 17 comments
Open

runtime error: mat1 and mat2 shapes cannot be multiplied #8

ijustloveses opened this issue Mar 15, 2023 · 17 comments

Comments

@ijustloveses
Copy link

File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 1628, in train
return inner_training_loop(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 1895, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 2637, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 2669, in compute_loss
outputs = model(**inputs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/_utils.py", line 543, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 529, in forward
return self.base_model(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 852, in forward
outputs = self.model.decoder(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 616, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 612, in custom_forward
return module(*inputs, output_attentions, None)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 158, in new_forward
output = old_forward(*args, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 167, in forward
value_states = self.v_proj(hidden_states).view(bsz, tgt_len, self.num_heads, self.head_dim)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 522, in forward
result = super().forward(x)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/opt/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
output += torch.matmul(subA, state.subB)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x7 and 8x4096)

@ijustloveses
Copy link
Author

A6000, same error happened with one GPU or two GPUs training.
Is it caused by bitsandbytes?

@kooshi
Copy link
Contributor

kooshi commented Mar 15, 2023

Yeah, seems to be: bitsandbytes-foundation/bitsandbytes#162

@gururise
Copy link
Contributor

gururise commented Mar 17, 2023

Strange, on my friend's machine with a 2x3090, he gets this error (Ubuntu 20.04 - Cuda 11.2), even though training on a single 3090. However, on another machine I'm using on the cloud with similar specs, but with only 1x3090 (Ubuntu 20.04 - Cuda 12.1) I do not get this error.

Could it have something to do with having 2x 3090's installed, even though only training on one? Or maybe it could be the cuda version? The working machine with a single 3090 is running Cuda 12

@saimarpaka
Copy link

ran into the same with 4xV100. any tips appreciated

@gururise
Copy link
Contributor

Force training only on a single GPU fixed it for us:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

@AngainorDev
Copy link
Contributor

Had the same when training on 2 gpus, using just python finetune.py

Got it running on both using torchrun

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py

Make sure your settings are consistent (gradient accumulation, micro steps)
For instance with 2x3090 I can run it with batch size 128, micro batch size 64
then gradient accumulation is one.
Using stuff that is not divisible here can lead to longer training or shorter batch size than planned.

@vsevolodl
Copy link

I had the same issue with 2xA6000 setup. Forcing training on a single GPU fixed it.

@Ludobico
Copy link

Had the same when training on 2 gpus, using just python finetune.py

Got it running on both using torchrun

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py

Make sure your settings are consistent (gradient accumulation, micro steps) For instance with 2x3090 I can run it with batch size 128, micro batch size 64 then gradient accumulation is one. Using stuff that is not divisible here can lead to longer training or shorter batch size than planned.

I fixed the issue above code
thanks

@kooshi
Copy link
Contributor

kooshi commented Mar 23, 2023

For everyone dealing with this, it's because BitsAndBytes doesn't play nice with Trainer when it tries to do DataParallelism.

We're not actually missing out, as DataParallelism is quite slow, and, as referenced above, finetune.py supports DistributedDataParallelism, which you can achieve with torchrun, and is much faster.

I also just submitted a PR to run this with Model Parallelism, so you can use multiple GPUs to run larger models.

#131

@Qubitium
Copy link

@kooshi Would this PR allow pipeline parallelism for inference on llama as well? Would it be possible to have a parallel sample for generate.py?

@kooshi
Copy link
Contributor

kooshi commented Mar 23, 2023

@kooshi Would this PR allow pipeline parallelism for inference on llama as well? Would it be possible to have a parallel sample for generate.py?

That's a good question. I haven't tried it, but I think it should at least run. If you have more than two GPUs, take the whole block from under if PIPE_CHUNKS > 0 and add it to the inference code.

I'm not sure if pipelining even helps with inference though

@ddingwang12
Copy link

Had the same when training on 2 gpus, using just python finetune.py

Got it running on both using torchrun

WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py

Make sure your settings are consistent (gradient accumulation, micro steps) For instance with 2x3090 I can run it with batch size 128, micro batch size 64 then gradient accumulation is one. Using stuff that is not divisible here can lead to longer training or shorter batch size than planned.
I fixed the question
thanks

@sfxworks
Copy link

sfxworks commented Mar 29, 2023

What if I have five?

Changing WORLD_SIZE=5, --nproc_per_node=5 and doing cuda_visibility to 0,1,2,3,4 gets me a segfault

100%|██████████| 1/1 [00:00<00:00, 291.86it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Map:   6%|▌         | 2994/49715 [00:09<02:25, 320.01 examples/s]Loading cached split indices for dataset at /root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-30ac7dbe29aff00a.arrow and /root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-5cbcf99c07ccbefe.arrow
Map:  39%|███▉      | 19328/49715 [01:13<02:01, 250.71 examples/s]Found cached dataset json (/root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████| 1/1 [00:00<00:00, 218.68it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-30ac7dbe29aff00a.arrow and /root/.cache/huggingface/datasets/yahma___json/yahma--alpaca-cleaned-27eb3c5e2aefa645/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-5cbcf99c07ccbefe.arrow
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 9 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 6) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
==================================================
finetune.py FAILED
--------------------------------------------------
Failures:
[1]:
  time      : 2023-03-29_19:19:20
  host      : finetune-job-7mpzm
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 7)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 7
[2]:
  time      : 2023-03-29_19:19:20
  host      : finetune-job-7mpzm
  rank      : 4 (local_rank: 4)
  exitcode  : -7 (pid: 10)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 10
--------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-29_19:19:20
  host      : finetune-job-7mpzm
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 6)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 6

edit, actually I get this with just one

@yangYJT
Copy link

yangYJT commented May 24, 2023

@sfxworks yes. me too. Have you solved it?

@brando90
Copy link

I get a similar issue with falcon but not on their official colab:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /lfs/hyperturing1/0/brando9/miniconda/envs/data_quality did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda-11.7/lib64')}
  warn(msg)
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/cuda-11.7/lib64: did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Updated by package ocaml')}
  warn(msg)
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('FILE')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
mode='disabled'
run=
report_to='none'
{'report_to': 'none', 'path2config': '/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/wandb_uu/sweep_configs/debug_config.yaml', 'program': '~/ultimate-utils/ultimate-utils-proj-src/uutils/wandb_uu/sweeps_common.py', 'project': 'playground', 'entity': 'brando', 'name': 'debug-logging-to-wandb-plataform-test', 'description': 'debug-not-logging-to-wandb-plataform-test', 'metric': {'name': 'train_loss', 'goal': 'minimize'}, 'method': 'random', 'optimizer': 'nadam', 'scheduler': 'cosine', 'lr': 0.0001, 'batch_size': 32, 'num_its': 2, 'run_cap': 1}
Found cached dataset json (/lfs/hyperturing1/0/brando9/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Found cached dataset json (/lfs/hyperturing1/0/brando9/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00,  1.61s/it]
Loading cached processed dataset at /lfs/hyperturing1/0/brando9/.cache/huggingface/datasets/timdettmers___json/timdettmers--openassistant-guanaco-6126c710748182cf/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-889fee109929377a.arrow
  0%|                                                                                                                           | 0/500 [00:00<?, ?it/s]You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/pdb.py", line 1723, in main
    pdb._runscript(mainpyfile)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/pdb.py", line 1583, in _runscript
    self.run(statement)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/bdb.py", line 598, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/afs/cs.stanford.edu/u/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/main_falcon_uu.py", line 34, in <module>
    main_falcon()
  File "/afs/cs.stanford.edu/u/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/main_falcon_uu.py", line 21, in main_falcon
    train(args)
  File "/afs/cs.stanford.edu/u/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/train/sft/qlora_ft.py", line 58, in train_falcon
    trainer.train()
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/trainer.py", line 2759, in training_step
    loss = self.compute_loss(model, inputs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/trainer.py", line 2784, in compute_loss
    outputs = model(**inputs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-7b/2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5/modelling_RW.py", line 753, in forward
    transformer_outputs = self.transformer(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-7b/2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5/modelling_RW.py", line 648, in forward
    outputs = block(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-7b/2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5/modelling_RW.py", line 385, in forward
    attn_outputs = self.self_attention(
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-7b/2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5/modelling_RW.py", line 242, in forward
    fused_qkv = self.query_key_value(hidden_states)  # [batch_size, seq_length, 3 x hidden_size]
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (2048x4544 and 1x10614784)
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/peft/tuners/lora.py(565)forward()
-> result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)

@NewEricWang
Copy link

I had the same issue with 2xA6000 setup. Forcing training on a single GPU fixed it.

How do you force training on a single GPU when running "python finetune.py "?

@NewEricWang
Copy link

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In my case, it don't work. I add the above code into "finetune.py". When running "python finetune.py", all GPU still are used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests