-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory #350
Comments
This seems to be realted to saving the model, my memory usage is aroung 16gb but when trainer trys to save the model or when model.save_pretrained is called the oom occures. So for some reason this line |
I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked. It's definitely an issue with one of these dependencies, need to pin point which one. Issue is not in PyTorch. |
I checked and bitsandbytes got bumped to 0.38.0 a few days ago, |
Super! |
thks, it is useful.
lksysML ***@***.***> 于2023年4月17日周一 17:52写道:
… I checked and bitsandbytes got bumped to 0.38.0 a few days ago, using
bitsandbytes ==0.37.2 fixes it for me
Super!
—
Reply to this email directly, view it on GitHub
<#350 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASCDHGYXIHNKRJWXP6CU46DXBUHGFANCNFSM6AAAAAAXAYA7NM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
为什么我3090 24g,跑llama-7b就报CUDA out of memory了??又试了下两张3090还是同样的错误 |
I tried peft==0.2.0, bitsandbytes ==0.37.2. But it still run out of memory when validate at second times. 7B model on 24G VRAM |
我和你一样的 |
happening for me right now on latest transformers and bnb 0.37.2.. |
) * Update README.md: Add Huggingface repo for 7B and 13B quantization * Update requirements.txt to pin PEFT and BNB version Reason - For BNB: tloen/alpaca-lora#350 For PEFT: huggingface/peft@c21afbe#diff-b3b90f453dea37bf90203fd395e9dedc21b21c9a38464c6b1572368c049ef8b2L116-L128
same issue. tried reverting versions to no avail. currently on 64gb vram |
anybody solved this problem? |
can anyone try peft 0.2.0 like @cnbeining change in his repo referencing this issue |
if u get 'undefined symbol: cget_col_row_stats' when doing this step, try the following
|
Worked for me! |
我使用的是tesla T4 16g显卡,想微调一下7B的模型,每次到第一个epoch 第200个迭代时,就会报显卡内存错误,看起来是验证时导出模型文件内存不够了,但我看https://zhuanlan.zhihu.com/p/616504594中12g的RTX 4070微调是可以的,这是什么原因了,我的尝试为:
1.--micro_batch_size 1 没有用
Training Alpaca-LoRA model with params:
base_model: decapoda-research/llama-7b-hf
data_path: ./trans_chinese_alpaca_data.json
output_dir: ./lora-alpaca-zh
batch_size: 128
micro_batch_size: 2
num_epochs: 2
learning_rate: 0.0003
cutoff_len: 256
val_set_size: 2000
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: ['q_proj', 'v_proj']
train_on_inputs: True
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: False
prompt template: alpaca
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:18<00:00, 1.79it/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Found cached dataset json (/root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 428.82it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-baf974d16126c7f1.arrow and /root/.cache/huggingface/datasets/json/default-d1370d3ed27da33a/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-6013f18c705337f9.arrow
{'loss': 2.2953, 'learning_rate': 2.9999999999999997e-05, 'epoch': 0.03}
{'loss': 2.208, 'learning_rate': 5.9999999999999995e-05, 'epoch': 0.05}
{'loss': 2.0048, 'learning_rate': 8.999999999999999e-05, 'epoch': 0.08}
{'loss': 1.6192, 'learning_rate': 0.00011999999999999999, 'epoch': 0.1}
{'loss': 1.381, 'learning_rate': 0.00015, 'epoch': 0.13}
{'loss': 1.2977, 'learning_rate': 0.00017999999999999998, 'epoch': 0.15}
{'loss': 1.2597, 'learning_rate': 0.00020999999999999998, 'epoch': 0.18}
{'loss': 1.2318, 'learning_rate': 0.00023999999999999998, 'epoch': 0.21}
{'loss': 1.2307, 'learning_rate': 0.00027, 'epoch': 0.23}
{'loss': 1.2053, 'learning_rate': 0.0003, 'epoch': 0.26}
{'loss': 1.1919, 'learning_rate': 0.0002955621301775148, 'epoch': 0.28}
{'loss': 1.1657, 'learning_rate': 0.00029112426035502955, 'epoch': 0.31}
{'loss': 1.1413, 'learning_rate': 0.00028668639053254437, 'epoch': 0.33}
{'loss': 1.1372, 'learning_rate': 0.00028224852071005914, 'epoch': 0.36}
{'loss': 1.1229, 'learning_rate': 0.00027781065088757395, 'epoch': 0.39}
{'loss': 1.1173, 'learning_rate': 0.0002733727810650887, 'epoch': 0.41}
{'loss': 1.1279, 'learning_rate': 0.00026893491124260353, 'epoch': 0.44}
{'loss': 1.1182, 'learning_rate': 0.0002644970414201183, 'epoch': 0.46}
{'loss': 1.112, 'learning_rate': 0.0002600591715976331, 'epoch': 0.49}
{'loss': 1.0954, 'learning_rate': 0.00025562130177514793, 'epoch': 0.52}
{'eval_loss': 1.1259599924087524, 'eval_runtime': 328.7811, 'eval_samples_per_second': 6.083, 'eval_steps_per_second': 0.76, 'epoch': 0.52}
26%|███████████████████████████████▏ | 200/776 [6:33:46<18:07:50, 113.32s/itTraceback (most recent call last):
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 276, in
fire.Fire(train)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 266, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2006, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2291, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2348, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2830, in save_model
self._save(output_dir)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/transformers/trainer.py", line 2873, in _save
state_dict = self.model.state_dict()
File "/new_data/yangxuan/alpaca-lora/finetune.py", line 259, in
self, old_state_dict()
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 268, in _save_to_state_dict
self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)
File "/root/miniconda3/envs/python39/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 96, in undo_layout
outputs = torch.empty_like(tensor) # note: not using .index_copy because it was slower on cuda
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 14.58 GiB total capacity; 13.37 GiB already allocated; 14.56 MiB free; 13.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
先谢!
The text was updated successfully, but these errors were encountered: