OOM error running the demo script `sft_llama2.py` on A100 GPU #824

Emerald01 · 2023-09-28T19:42:45Z

Hello:

I am testing run trl/examples/research_projects/stack_llama_2/scripts/sft_llama2.py
I pull the latest main branch and pip install locally

The running env: GCP with eight A100 GPUs (memory 40G)

Just follow the README without any change

accelerate launch sft_llama2.py --output_dir=XXX

the accelerate configuration is from the example configure with 8 GPUs

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

But I got the OOM error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 1; 39.42 GiB total capacity; 37.39 GiB already allocated; 245.00 MiB free; 38.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I checked the memory usage. After finishing model load, i.e., base_model = AutoModelForCausalLM.from_pretrained(...), the GPU memory usage is only about 4707MiB / 40960MiB
So it seems that, the script used load_in_4bit to load the model is pretty effective.

Right before calling trainer.train(), the memory usage is still pretty reasonable, about 8071MiB / 40960MiB

But when it starts to execute trainer.train() it quickly blows up the memory. I am wondering if there is any obvious problem right here? Any configure issue or some bug that eats up the memory? Since I just followed up the demo code without a single line of change, I hope someone could answer this question so I can gain confidence over this codebase and move on with this.

Thank you so much for your help!

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-10-10T18:00:56Z

Hi @Emerald01 !
I think that the gradient checkpointing is disabled in your case (the default value changed in the latest changes), can you try to add gradient_checkpointing=True here: https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/sft_llama2.py#L188

Emerald01 · 2023-10-13T05:08:47Z

Yes, that seems resolved the issue with this flag on + I set ddp_find_unused_parameters=False.
Thank you for your help @younesbelkada

Now it comes to another running issue when I tried to run the same script but with deepspeed Zero3.
There is nothing special, I just reused the configure in the example https://github.com/huggingface/trl/blob/main/examples/accelerate_configs/deepspeed_zero3.yaml

It reports OOM again:

model = prepare_model_for_kbit_training(
  File "/usr/local/lib/python3.8/dist-packages/peft/utils/other.py", line 101, in prepare_model_for_kbit_training
    param.data = param.data.to(torch.float32)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 500.00 MiB (GPU 0; 39.42 GiB total capacity; 3.97 GiB already allocated; 481.00 MiB free; 4.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I believe for deepspeed ZERO3 it tries to distribute one single model to all available devices, so obviously this does not work as expected. I think trl claims that those scripts should support deepspeed without any extra code, so I guess it might be some configurations again...

Actually I am a little confused here. Since in the original sft_llama2.py, it is using the 4bit quantization to load the entire model to each GPU (it seems pretty efficient that each GPU only used 10G memory). When if deepspeed takes over, what should be expected? Is it 4bit quant + ZERO3, or just ZERO3? The final question is that, does trl support torch.distributed.FSDP? To my feeling, torch provides a much cleaner ZERO3 solution. Anyway, if we can have more detailed documentations or blogs discussing these things, that would be very helpful.

allanj · 2023-10-16T07:03:56Z

It looks like it will automatically set the ddp_find_unused_parameters=False if the gradient checkpointing is True

https://github.com/huggingface/transformers/blob/21dc5859421cf0d7d82d374b10f533611745a8c5/src/transformers/trainer.py#L1392-L1395

I still have the issue with the following config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

allanj · 2023-10-16T07:20:36Z

It seems something happend here making the GPU memory explode?

https://github.com/huggingface/peft/blob/aaa7e9f44a6405af819e721d7ee7fc6dd190c980/src/peft/utils/other.py#L82-L86

    if not is_gptq_quantized:
        # cast all non INT8 parameters to fp32
        for param in model.parameters():
            if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
                param.data = param.data.to(torch.float32)

allanj · 2023-10-16T07:49:35Z

It seems something happend here making the GPU memory explode?

https://github.com/huggingface/peft/blob/aaa7e9f44a6405af819e721d7ee7fc6dd190c980/src/peft/utils/other.py#L82-L86
    if not is_gptq_quantized:
        # cast all non INT8 parameters to fp32
        for param in model.parameters():
            if (param.dtype == torch.float16) or (param.dtype == torch.bfloat16):
                param.data = param.data.to(torch.float32)

My model initialization is as follows:

model = AutoModelForCausalLM.from_pretrained(script_args.model_name_or_path,
                                                 low_cpu_mem_usage=True,
                                                 quantization_config=bnb_config,
                                                 device_map={"": Accelerator().local_process_index},
                                                 torch_dtype=torch.bfloat16,
                                                 load_in_4bit=True)

If I remove this torch_dtype=torch.bfloat16 , it seems working.

github-actions · 2023-11-09T15:05:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Emerald01 changed the title ~~OOM error running sft_llama2.py on 40G A100 GPU~~ OOM error running sft_llama2.py on A100 GPU Sep 29, 2023

Emerald01 changed the title ~~OOM error running sft_llama2.py on A100 GPU~~ OOM error running the demo script sft_llama2.py on A100 GPU Sep 29, 2023

github-actions bot closed this as completed Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error running the demo script `sft_llama2.py` on A100 GPU #824

OOM error running the demo script `sft_llama2.py` on A100 GPU #824

Emerald01 commented Sep 28, 2023 •

edited

Loading

younesbelkada commented Oct 10, 2023

Emerald01 commented Oct 13, 2023 •

edited

Loading

allanj commented Oct 16, 2023

allanj commented Oct 16, 2023

allanj commented Oct 16, 2023

github-actions bot commented Nov 9, 2023

OOM error running the demo script sft_llama2.py on A100 GPU #824

OOM error running the demo script sft_llama2.py on A100 GPU #824

Comments

Emerald01 commented Sep 28, 2023 • edited Loading

younesbelkada commented Oct 10, 2023

Emerald01 commented Oct 13, 2023 • edited Loading

allanj commented Oct 16, 2023

allanj commented Oct 16, 2023

allanj commented Oct 16, 2023

github-actions bot commented Nov 9, 2023

OOM error running the demo script `sft_llama2.py` on A100 GPU #824

OOM error running the demo script `sft_llama2.py` on A100 GPU #824

Emerald01 commented Sep 28, 2023 •

edited

Loading

Emerald01 commented Oct 13, 2023 •

edited

Loading