CUDA out of memory #350

yangxuan14nlp · 2023-04-17T07:05:15Z

我使用的是tesla T4 16g显卡，想微调一下7B的模型，每次到第一个epoch 第200个迭代时，就会报显卡内存错误，看起来是验证时导出模型文件内存不够了，但我看https://zhuanlan.zhihu.com/p/616504594中12g的RTX 4070微调是可以的，这是什么原因了，我的尝试为：
1.--micro_batch_size 1 没有用

Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: ./trans_chinese_alpaca_data.json
output_dir: ./lora-alpaca-zh
batch_size: 128
micro_batch_size: 2
num_epochs: 2
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:18<00:00, 1.79it/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 428.82it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-baf974d16126c7f1.arrow and /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-6013f18c705337f9.arrow
{'loss': 2.2953, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03}
{'loss': 2.208, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05}
{'loss': 2.0048, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08}
{'loss': 1.6192, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1}
{'loss': 1.381, 'learning_rate': 0.00015, 'epoch': 0.13}
{'loss': 1.2977, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15}
{'loss': 1.2597, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18}
{'loss': 1.2318, 'learning_rate': 0.00023999999999999998, 'epoch': 0.21}
{'loss': 1.2307, 'learning_rate': 0.00027, 'epoch': 0.23}
{'loss': 1.2053, 'learning_rate': 0.0003, 'epoch': 0.26}
{'loss': 1.1919, 'learning_rate': 0.0002955621301775148, 'epoch': 0.28}
{'loss': 1.1657, 'learning_rate': 0.00029112426035502955, 'epoch': 0.31}
{'loss': 1.1413, 'learning_rate': 0.00028668639053254437, 'epoch': 0.33}
{'loss': 1.1372, 'learning_rate': 0.00028224852071005914, 'epoch': 0.36}
{'loss': 1.1229, 'learning_rate': 0.00027781065088757395, 'epoch': 0.39}
{'loss': 1.1173, 'learning_rate': 0.0002733727810650887, 'epoch': 0.41}
{'loss': 1.1279, 'learning_rate': 0.00026893491124260353, 'epoch': 0.44}
{'loss': 1.1182, 'learning_rate': 0.0002644970414201183, 'epoch': 0.46}
{'loss': 1.112, 'learning_rate': 0.0002600591715976331, 'epoch': 0.49}
{'loss': 1.0954, 'learning_rate': 0.00025562130177514793, 'epoch': 0.52}
{'eval_loss': 1.1259599924087524, 'eval_runtime': 328.7811, 'eval_samples_per_second': 6.083, 'eval_steps_per_second': 0.76, 'epoch': 0.52}
26%|███████████████████████████████▏ | 200/776 [6:33:46<18:07:50, 113.32s/itTraceback (most recent call last):
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 276, in
fire.Fire(train)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 266, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2006, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2291, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2348, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2830, in save_model
self._save(output_dir)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2873, in _save
state_dict = self.model.state_dict()
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 259, in
self, old_state_dict()
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 14.58 GiB total capacity; 13.37 GiB already allocated; 14.56 MiB free; 13.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

先谢！

lksysML · 2023-04-17T07:13:20Z

Same error: #344

It errors out at 200 iterations.

@tloen

KukumavMozolo · 2023-04-17T07:45:14Z

This seems to be realted to saving the model, my memory usage is aroung 16gb but when trainer trys to save the model or when model.save_pretrained is called the oom occures. So for some reason this line
'self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
trys to allocate more than additional 8gb of memory

lksysML · 2023-04-17T09:00:59Z

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked.

It's definitely an issue with one of these dependencies, need to pin point which one. Issue is not in PyTorch.

KukumavMozolo · 2023-04-17T09:29:14Z

I checked and bitsandbytes got bumped to 0.38.0 a few days ago,
using bitsandbytes ==0.37.2 fixes it for me

lksysML · 2023-04-17T09:52:23Z

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Super!

yangxuan14nlp · 2023-04-18T09:20:43Z

thks, it is useful. lksysML ***@***.***> 于2023年4月17日周一 17:52写道：

…

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me Super! — Reply to this email directly, view it on GitHub <#350 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASCDHGYXIHNKRJWXP6CU46DXBUHGFANCNFSM6AAAAAAXAYA7NM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Stark-zheng · 2023-04-26T01:55:47Z

为什么我3090 24g，跑llama-7b就报CUDA out of memory了？？又试了下两张3090还是同样的错误
model = LlamaForCausalLM.from_pretrained(
在这步加载模型的时候就报了： RuntimeError: CUDA error: out of memory
这是我的参数设置：
Training Alpaca-LoRA model with params:
base_model: ../LLaMA-7B
data_path: ./instruction_data.json
output_dir: ./lora-alpaca
batch_size: 24
micro_batch_size: 1
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 400
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
add_eos_token: False
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca_short

luxuriance19 · 2023-04-26T11:56:25Z

I tried peft==0.2.0, bitsandbytes ==0.37.2. But it still run out of memory when validate at second times. 7B model on 24G VRAM

zh25714 · 2023-04-29T12:48:08Z

为什么我3090 24g，跑llama-7b就报CUDA out of memory了？？又试了下两张3090还是同样的错误 model = LlamaForCausalLM.from_pretrained( 在这步加载模型的时候就报了： RuntimeError: CUDA error: out of memory 这是我的参数设置： Training Alpaca-LoRA model with params: base_model: ../LLaMA-7B data_path: ./instruction_data.json output_dir: ./lora-alpaca batch_size: 24 micro_batch_size: 1 num_epochs: 3 learning_rate: 0.0003 cutoff_len: 400 val_set_size: 2000 lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: ['q_proj', 'v_proj'] train_on_inputs: True add_eos_token: False group_by_length: False wandb_project: wandb_run_name: wandb_watch: wandb_log_model: resume_from_checkpoint: False prompt template: alpaca_short

我和你一样的

teknium1 · 2023-04-30T10:36:47Z

happening for me right now on latest transformers and bnb 0.37.2..

Reason - For BNB: tloen/alpaca-lora#350 For PEFT: huggingface/peft@c21afbe#diff-b3b90f453dea37bf90203fd395e9dedc21b21c9a38464c6b1572368c049ef8b2L116-L128

) * Update README.md: Add Huggingface repo for 7B and 13B quantization * Update requirements.txt to pin PEFT and BNB version Reason - For BNB: tloen/alpaca-lora#350 For PEFT: huggingface/peft@c21afbe#diff-b3b90f453dea37bf90203fd395e9dedc21b21c9a38464c6b1572368c049ef8b2L116-L128

freelerobot · 2023-05-04T02:33:20Z

same issue. tried reverting versions to no avail. currently on 64gb vram

luxuriance19 · 2023-05-06T02:35:10Z

anybody solved this problem？

teknium1 · 2023-05-06T06:26:40Z

can anyone try peft 0.2.0 like @cnbeining change in his repo referencing this issue

jasonvanf · 2023-05-06T07:15:00Z

using bitsandbytes ==0.37.2

if u get 'undefined symbol: cget_col_row_stats' when doing this step, try the following

cp libbitsandbytes_cuda117.so libbitsandbytes_cpu.so

afnanhabib787 · 2023-06-05T09:15:49Z

I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using bitsandbytes ==0.37.2 fixes it for me

Super!

Worked for me!

KukumavMozolo mentioned this issue Apr 17, 2023

New OOM bug introduced in bitsandbytes 0.38.x? bitsandbytes-foundation/bitsandbytes#324

Closed

Facico mentioned this issue Apr 19, 2023

model saving error Facico/Chinese-Vicuna#81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory #350

CUDA out of memory #350

yangxuan14nlp commented Apr 17, 2023

lksysML commented Apr 17, 2023

KukumavMozolo commented Apr 17, 2023 •

edited

Loading

lksysML commented Apr 17, 2023 •

edited

Loading

KukumavMozolo commented Apr 17, 2023 •

edited

Loading

lksysML commented Apr 17, 2023

yangxuan14nlp commented Apr 18, 2023 via email

Stark-zheng commented Apr 26, 2023

luxuriance19 commented Apr 26, 2023

zh25714 commented Apr 29, 2023

teknium1 commented Apr 30, 2023

freelerobot commented May 4, 2023

luxuriance19 commented May 6, 2023

teknium1 commented May 6, 2023

jasonvanf commented May 6, 2023 •

edited

Loading

afnanhabib787 commented Jun 5, 2023

CUDA out of memory #350

CUDA out of memory #350

Comments

yangxuan14nlp commented Apr 17, 2023

lksysML commented Apr 17, 2023

KukumavMozolo commented Apr 17, 2023 • edited Loading

lksysML commented Apr 17, 2023 • edited Loading

KukumavMozolo commented Apr 17, 2023 • edited Loading

lksysML commented Apr 17, 2023

yangxuan14nlp commented Apr 18, 2023 via email

Stark-zheng commented Apr 26, 2023

luxuriance19 commented Apr 26, 2023

zh25714 commented Apr 29, 2023

teknium1 commented Apr 30, 2023

freelerobot commented May 4, 2023

luxuriance19 commented May 6, 2023

teknium1 commented May 6, 2023

jasonvanf commented May 6, 2023 • edited Loading

afnanhabib787 commented Jun 5, 2023

KukumavMozolo commented Apr 17, 2023 •

edited

Loading

lksysML commented Apr 17, 2023 •

edited

Loading

KukumavMozolo commented Apr 17, 2023 •

edited

Loading

jasonvanf commented May 6, 2023 •

edited

Loading