Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #350

Open
yangxuan14nlp opened this issue Apr 17, 2023 · 16 comments
Open

CUDA out of memory #350

yangxuan14nlp opened this issue Apr 17, 2023 · 16 comments

Comments

@yangxuan14nlp
Copy link

我使用的是tesla T4 16g显卡,想微调一下7B的模型,每次到第一个epoch 第200个迭代时,就会报显卡内存错误,看起来是验证时导出模型文件内存不够了,但我看https://zhuanlan.zhihu.com/p/616504594中12g的RTX 4070微调是可以的,这是什么原因了,我的尝试为:
1.--micro_batch_size 1 没有用

Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: ./trans_chinese_alpaca_data.json
output_dir: ./lora-alpaca-zh
batch_size: 128
micro_batch_size: 2
num_epochs: 2
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:18<00:00, 1.79it/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 428.82it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-baf974d16126c7f1.arrow and /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-6013f18c705337f9.arrow
{'loss': 2.2953, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03}
{'loss': 2.208, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05}
{'loss': 2.0048, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08}
{'loss': 1.6192, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1}
{'loss': 1.381, 'learning_rate': 0.00015, 'epoch': 0.13}
{'loss': 1.2977, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15}
{'loss': 1.2597, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18}
{'loss': 1.2318, 'learning_rate': 0.00023999999999999998, 'epoch': 0.21}
{'loss': 1.2307, 'learning_rate': 0.00027, 'epoch': 0.23}
{'loss': 1.2053, 'learning_rate': 0.0003, 'epoch': 0.26}
{'loss': 1.1919, 'learning_rate': 0.0002955621301775148, 'epoch': 0.28}
{'loss': 1.1657, 'learning_rate': 0.00029112426035502955, 'epoch': 0.31}
{'loss': 1.1413, 'learning_rate': 0.00028668639053254437, 'epoch': 0.33}
{'loss': 1.1372, 'learning_rate': 0.00028224852071005914, 'epoch': 0.36}
{'loss': 1.1229, 'learning_rate': 0.00027781065088757395, 'epoch': 0.39}
{'loss': 1.1173, 'learning_rate': 0.0002733727810650887, 'epoch': 0.41}
{'loss': 1.1279, 'learning_rate': 0.00026893491124260353, 'epoch': 0.44}
{'loss': 1.1182, 'learning_rate': 0.0002644970414201183, 'epoch': 0.46}
{'loss': 1.112, 'learning_rate': 0.0002600591715976331, 'epoch': 0.49}
{'loss': 1.0954, 'learning_rate': 0.00025562130177514793, 'epoch': 0.52}
{'eval_loss': 1.1259599924087524, 'eval_runtime': 328.7811, 'eval_samples_per_second': 6.083, 'eval_steps_per_second': 0.76, 'epoch': 0.52}
26%|███████████████████████████████▏ | 200/776 [6:33:46<18:07:50, 113.32s/itTraceback (most recent call last):
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 276, in
fire.Fire(train)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 266, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2006, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2291, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2348, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2830, in save_model
self._save(output_dir)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2873, in _save
state_dict = self.model.state_dict()
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 259, in
self, old_state_dict()
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 14.58 GiB total capacity; 13.37 GiB already allocated; 14.56 MiB free; 13.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

先谢!

@lksysML
Copy link

lksysML commented Apr 17, 2023

Same error: #344

It errors out at 200 iterations.

@tloen

@KukumavMozolo
Copy link

KukumavMozolo commented Apr 17, 2023

This seems to be realted to saving the model, my memory usage is aroung 16gb but when trainer trys to save the model or when model.save_pretrained is called the oom occures. So for some reason this line
'self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
trys to allocate more than additional 8gb of memory

@lksysML
Copy link

lksysML commented Apr 17, 2023

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked.

It's definitely an issue with one of these dependencies, need to pin point which one. Issue is not in PyTorch.

@KukumavMozolo
Copy link

KukumavMozolo commented Apr 17, 2023

I checked and bitsandbytes got bumped to 0.38.0 a few days ago,
using bitsandbytes ==0.37.2 fixes it for me

@lksysML
Copy link

lksysML commented Apr 17, 2023

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Super!

@yangxuan14nlp
Copy link
Author

yangxuan14nlp commented Apr 18, 2023 via email

@Stark-zheng
Copy link

为什么我3090 24g,跑llama-7b就报CUDA out of memory了??又试了下两张3090还是同样的错误
model = LlamaForCausalLM.from_pretrained(
在这步加载模型的时候就报了: RuntimeError: CUDA error: out of memory
这是我的参数设置:
Training Alpaca-LoRA model with params:
base_model: ../LLaMA-7B
data_path: ./instruction_data.json
output_dir: ./lora-alpaca
batch_size: 24
micro_batch_size: 1
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 400
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
add_eos_token: False
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca_short

@luxuriance19
Copy link

I tried peft==0.2.0, bitsandbytes ==0.37.2. But it still run out of memory when validate at second times. 7B model on 24G VRAM

@zh25714
Copy link

zh25714 commented Apr 29, 2023

为什么我3090 24g,跑llama-7b就报CUDA out of memory了??又试了下两张3090还是同样的错误 model = LlamaForCausalLM.from_pretrained( 在这步加载模型的时候就报了: RuntimeError: CUDA error: out of memory 这是我的参数设置: Training Alpaca-LoRA model with params: base_model: ../LLaMA-7B data_path: ./instruction_data.json output_dir: ./lora-alpaca batch_size: 24 micro_batch_size: 1 num_epochs: 3 learning_rate: 0.0003 cutoff_len: 400 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True add_eos_token: False group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca_short

我和你一样的

@teknium1
Copy link

happening for me right now on latest transformers and bnb 0.37.2..

PeiqinSun pushed a commit to megvii-research/Sparsebit that referenced this issue May 1, 2023
)

* Update README.md: Add Huggingface repo for 7B and 13B quantization

* Update requirements.txt to pin PEFT and BNB version

Reason -

For BNB: tloen/alpaca-lora#350

For PEFT: huggingface/peft@c21afbe#diff-b3b90f453dea37bf90203fd395e9dedc21b21c9a38464c6b1572368c049ef8b2L116-L128
@freelerobot
Copy link

same issue. tried reverting versions to no avail. currently on 64gb vram

@luxuriance19
Copy link

anybody solved this problem?

@teknium1
Copy link

teknium1 commented May 6, 2023

can anyone try peft 0.2.0 like @cnbeining change in his repo referencing this issue

@jasonvanf
Copy link

jasonvanf commented May 6, 2023

using bitsandbytes ==0.37.2

if u get 'undefined symbol: cget_col_row_stats' when doing this step, try the following

cp libbitsandbytes_cuda117.so libbitsandbytes_cpu.so

@afnanhabib787
Copy link

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Super!

Worked for me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

11 participants