Finetuning Llama-2-13B with 1x a100 80gb? torch.cuda.OutOfMemoryError #356

chigkim · 2023-08-07T12:48:32Z

chigkim
Aug 7, 2023

I'm trying to finetune Llama-2-13B with 1x a100 80gb, but it gives me torch.cuda.OutOfMemoryError.
Readme said "To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly."
I made the following changes to scripts/finetune.sh:
from --per_device_train_batch_size 16 to --per_device_train_batch_size 2 (16/8)
From --gradient_accumulation_steps 1 to --gradient_accumulation_steps 8 (1*8)
I changed paths.
Also I uncommented:
PROMPT_VERSION="llava_llama_2"
MODEL_VERSION="llama-2-7b-chat"
Then I ran ./finetune.sh, and this is the output.

# ../finetune/finetune.sh
[2023-08-07 14:13:06,192] [INFO] [real_accelerator.py:110:get_accelerator] Setti
ng ds_accelerator to cuda (auto detect)
[2023-08-07 14:13:07,994] [WARNING] [runner.py:196:fetch_hostfile] Unable to fin
d hostfile, will proceed with training with local resources only.
[2023-08-07 14:13:07,994] [INFO] [runner.py:555:main] cmd = /workspace/.venv/bin
/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --
master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/trai
n/train_mem.py --deepspeed ./scripts/zero2.json --model_name_or_path ../finetune
/hf --version llava_llama_2 --data_path ../finetune/instruct65k.json --image_fol
der ../finetune/train2017 --vision_tower openai/clip-vit-large-patch14 --pretrai
n_mm_mlp_adapter ../pretrain/mm_projector.bin --mm_vision_select_layer -2 --mm_u
se_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir ../
finetune --num_train_epochs 1 --per_device_train_batch_size 2 --per_device_eval_
batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy no --save_str
ategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weigh
t_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf
32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_w
orkers 4 --lazy_preprocess True --report_to wandb
[2023-08-07 14:13:09,097] [INFO] [real_accelerator.py:110:get_accelerator] Setti
ng ds_accelerator to cuda (auto detect)
[2023-08-07 14:13:10,645] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=l
ibnccl-dev=2.15.5-1+cuda11.8
[2023-08-07 14:13:10,645] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_V
ERSION=2.15.5-1
[2023-08-07 14:13:10,645] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.15.5-1
[2023-08-07 14:13:10,645] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_N
AME=libnccl-dev
[2023-08-07 14:13:10,645] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnc
cl2=2.15.5-1+cuda11.8
[2023-08-07 14:13:10,645] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=
libnccl2
[2023-08-07 14:13:10,645] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSI
ON=2.15.5-1
[2023-08-07 14:13:10,645] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localho
st': [0]}
[2023-08-07 14:13:10,645] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=
1, node_rank=0
[2023-08-07 14:13:10,645] [INFO] [launch.py:162:main] global_rank_mapping=defaul
tdict(<class 'list'>, {'localhost': [0]})
[2023-08-07 14:13:10,645] [INFO] [launch.py:163:main] dist_world_size=1
[2023-08-07 14:13:10,645] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVIC
ES=0
[2023-08-07 14:13:12,927] [INFO] [real_accelerator.py:110:get_accelerator] Setti
ng ds_accelerator to cuda (auto detect)
[2023-08-07 14:13:13,424] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL ba
ckend in DeepSpeed not yet implemented
[2023-08-07 14:13:13,424] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-07 14:13:13,424] [INFO] [comm.py:625:init_distributed] Initializing Tor
chBackend in DeepSpeed with backend nccl
You are using a model of type llama to instantiate a model of type llava. This i
s not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒| 3/3 [01:06<00:00, 22.26s/it]
Some weights of LlavaLlamaForCausalLM were not initialized from the model checkp
oint at ../finetune/hf and are newly initialized: ['model.layers.9.self_attn.rot
ary_emb.inv_freq', 'model.layers.38.self_attn.rotary_emb.inv_freq', 'model.layer
s.22.self_attn.rotary_emb.inv_freq', 'model.layers.35.self_attn.rotary_emb.inv_f
req', 'model.layers.36.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn
.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.la
yers.12.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv
_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_a
ttn.rotary_emb.inv_freq', 'model.layers.28.self_attn.rotary_emb.inv_freq', 'mode
l.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.29.self_attn.rotary_em
b.inv_freq', 'model.layers.39.self_attn.rotary_emb.inv_freq', 'model.layers.4.se
lf_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', '
model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotar
y_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.
23.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_fre
q', 'model.layers.34.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.r
otary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.laye
rs.6.self_attn.rotary_emb.inv_freq', 'model.layers.32.self_attn.rotary_emb.inv_f
req', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.0.self_attn
.rotary_emb.inv_freq', 'model.layers.26.self_attn.rotary_emb.inv_freq', 'model.l
ayers.37.self_attn.rotary_emb.inv_freq', 'model.layers.31.self_attn.rotary_emb.i
nv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.24.self
_attn.rotary_emb.inv_freq', 'model.layers.27.self_attn.rotary_emb.inv_freq', 'mo
del.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.25.self_attn.rotary_e
mb.inv_freq', 'model.layers.30.self_attn.rotary_emb.inv_freq', 'model.layers.33.
self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq']
You should probably TRAIN this model on a down-stream task to be able to use it
for predictions and inference.
Formatting inputs...Skip in lazy mode
Traceback (most recent call last):
  File "/workspace/LLaVA/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/workspace/LLaVA/llava/train/train.py", line 909, in train
    trainer.train()
  File "/workspace/.venv/lib/python3.10/site-packages/transformers/trainer.py",
line 1539, in train
    return inner_training_loop(
  File "/workspace/.venv/lib/python3.10/site-packages/transformers/trainer.py",
line 1656, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/workspace/.venv/lib/python3.10/site-packages/accelerate/accelerator.py"
, line 1198, in prepare
    result = self._prepare_deepspeed(*args)
  File "/workspace/.venv/lib/python3.10/site-packages/accelerate/accelerator.py"
, line 1537, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/workspace/.venv/lib/python3.10/site-packages/deepspeed/__init__.py", li
ne 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/workspace/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.p
y", line 309, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/workspace/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.p
y", line 1184, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/workspace/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.p
y", line 1419, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/workspace/.venv/lib/python3.10/site-packages/deepspeed/runtime/zero/sta
ge_1_and_2.py", line 346, in __init__
    self.device).clone().float().detach())
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.51 GiB (GP
U 0; 79.15 GiB total capacity; 49.11 GiB already allocated; 29.40 GiB free; 49.1
3 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory tr
y setting max_split_size_mb to avoid fragmentation.  See documentation for Memor
y Management and PYTORCH_CUDA_ALLOC_CONF
[2023-08-07 14:17:33,098] [INFO] [launch.py:315:sigkill_handler] Killing subproc
ess 11097
[2023-08-07 14:17:33,098] [ERROR] [launch.py:321:sigkill_handler] ['/workspace/.
venv/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=0', '--deepspe
ed', './scripts/zero2.json', '--model_name_or_path', '../finetune/hf', '--versio
n', 'llava_llama_2', '--data_path', '../finetune/instruct65k.json', '--image_fol
der', '../finetune/train2017', '--vision_tower', 'openai/clip-vit-large-patch14'
, '--pretrain_mm_mlp_adapter', '../pretrain/mm_projector.bin', '--mm_vision_sele
ct_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'F
alse', '--bf16', 'True', '--output_dir', '../finetune', '--num_train_epochs', '1
', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '4', '-
-gradient_accumulation_steps', '8', '--evaluation_strategy', 'no', '--save_strat
egy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_r
ate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_
type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length',
 '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '-
-lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = 1

Answered by haotian-liu

Aug 7, 2023

Full-finetuning 13B model may be too tough for a single A100 80GB 😭

You can try QLoRA, which is optimized for low VRAM usage.

To further save the memory, you can try zero3_offload, and see explanations here

View full answer

haotian-liu · 2023-08-07T16:53:33Z

haotian-liu
Aug 7, 2023
Maintainer

Full-finetuning 13B model may be too tough for a single A100 80GB 😭

You can try QLoRA, which is optimized for low VRAM usage.

To further save the memory, you can try zero3_offload, and see explanations here

0 replies

chigkim · 2023-08-07T18:33:11Z

chigkim
Aug 7, 2023
Author

Ah, thanks for your response!

Do you know what's the minimum number of a100 80gb gpus required in order to finetune llama-2-13b?

I don't have a local machine, so I'm renting a cloud gpu without a persistent boot drive. If I can, I would like to avoid 8x a100 80gb idling during the environment setup. Building and installing Flash-attn alone takes like 30 minutes.

Also, could you explain when to point zero2.json vs zero3.json for deepspeed?

0 replies

haotian-liu · 2023-08-08T00:31:28Z

haotian-liu
Aug 8, 2023
Maintainer

When you have enough VRAM, zero2 is slightly faster than zero3.
zero3 has additional ability of offloading paramters to CPU or even NVME disk, basically allow you to scale A100 80G to (reasonably) infinite VRAM, at the cost of speed degradation of course as you are transferring parameters from CPU to GPU all the time.

If you do want to finetune the full model, then maybe 8 GPUs is needed, if you are okay with LoRA or QLoRA, then maybe a single GPU is also fine (I currently do not have free machine to test these).

If I can, I would like to avoid 8x a100 80gb idling during the environment setup. Building and installing Flash-attn alone takes like 30 minutes.

AFAIK, google cloud and aws ec2 both can allow you to "persist" disk, so that it is not deleted after the server being terminated. You may also create custom image based on the disk state. So you can basically compile everything on a single-GPU server, make sure things are running properly, and create a custom image. Next time, create new instances from the custom image, instead of the default image. All packages will be there and you do not need to compile from scratch.

You can search about this using keywords like "google cloud create custom image".

0 replies

chigkim · 2023-08-08T02:22:41Z

chigkim
Aug 8, 2023
Author

Thanks so much for the info.
I'm currently running on Runpod which has much cheaper GPUs, but it doesn't seem to have an option to make your boot drive persistent or create/restore image of your boot drive like Google Compute engine or Amazon Aws.
They do have persistent network volume that you can attach for data. Maybe I can just create python -m venv there.
I guess I have to test and see how much impact it has on speed if I go that route.

1 reply

haotian-liu Aug 8, 2023
Maintainer

Just another piece of info in case it is helpful. If the base image is always the same (e.g. cuda is always say 11.7 and in the same location). You can probably set up your conda environment, and tar your whole conda env. Whenever it comes to a new machine, you just copy this tar, and untar to recover your previously compiled env.

You can compare which works better.

chigkim · 2023-08-08T03:02:57Z

chigkim
Aug 8, 2023
Author

That's also a great option! Thanks!!!

1 reply

nighting0le01 Aug 25, 2023

How to use the lora fine tuned weights? can't directly use that folder
Please suggest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Llama-2-13B with 1x a100 80gb? torch.cuda.OutOfMemoryError #356

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Finetuning Llama-2-13B with 1x a100 80gb? torch.cuda.OutOfMemoryError #356

chigkim Aug 7, 2023

Replies: 5 comments · 2 replies

haotian-liu Aug 7, 2023 Maintainer

chigkim Aug 7, 2023 Author

haotian-liu Aug 8, 2023 Maintainer

chigkim Aug 8, 2023 Author

haotian-liu Aug 8, 2023 Maintainer

chigkim Aug 8, 2023 Author

nighting0le01 Aug 25, 2023

chigkim
Aug 7, 2023

Replies: 5 comments 2 replies

haotian-liu
Aug 7, 2023
Maintainer

chigkim
Aug 7, 2023
Author

haotian-liu
Aug 8, 2023
Maintainer

chigkim
Aug 8, 2023
Author

haotian-liu Aug 8, 2023
Maintainer

chigkim
Aug 8, 2023
Author