Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26

achew010 · 2024-05-29T02:43:07Z

Description

This PR addresses #18 with the following contributions

Introduce patch on AutoGPTQ's make_sure_no_tensor_in_meta_device to avoid raising an error when model has no bias in low memory mode
Workaround to configuring device_map to cpu when loading checkpoints to avoid gpu memory consumption before trainer initialization.
Note: This approach diverts consumption to cpu mem which could still bottleneck, a better approach could be to load it to meta device. QLoRA currently loads quantized models to cpu in low memory mode as well. See here.

TODO:

Actual device mapping to meta device

Tests

Reproduction command

accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path TheBloke/Llama-2-70B-GPTQ --acceleration_framework_config_file /data/aaron/experimental/test3/scripts/benchmarks/../../sample-configurations/accelerated-peft-autogptq-sample-configuration.yaml --packing True --max_seq_len 4096 --learning_rate 2e-4 --fp16 True --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.0 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '\n### Response:' --dataset_text_field 'output' --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 10 --training_data_path /data/aaron/experimental/test3/benchmark_outputs_final/data/cache.json --per_device_train_batch_size 2 --output_dir benchmark_outputs/exp_57/hf --skip_memory_metrics False

Comparison

Before Fix:

Memory Explosion in GPTQ-LoRA without low memory mode observed in the memory metrics, Nvidia (78.80 GiB) and Torch (36.1 GiB) compared to QLoRA with low memory mode enabled.

model name	framework config	num gpus	per device train batch size	nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
NousResearch/Llama-2-70b-hf	accelerated-peft-bnb	2	2	51.40	46.52	19.17	417
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq	2	2	78.80	45.40	36.14	429

After Fix:

With Low Memory mode enabled, GPTQ-LoRA now has lower memory consumption Nvidia (49.4 GiB) and Torch (18.1 GiB) and is comparable with QLoRA

model name	framework config	num gpus	per device train batch size	nvidia mem reserved (GiB)	peak torch mem alloc (GiB)	torch mem alloc (GiB)	throughput (toks/sec)
NousResearch/Llama-2-70b-hf	accelerated-peft-bnb	2	2	51.40	46.52	19.17	414
TheBloke/Llama-2-70B-GPTQ	accelerated-peft-autogptq	2	2	49.44	44.87	18.13	428

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py

plugins/accelerated-peft/src/fms_acceleration_peft/autogptq_utils.py

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py

Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>

fabianlim · 2024-05-29T06:58:18Z

@achew010 can you update the top-level comment, with what was the previous memory allocation, and verify that the new measurements are obtained after reversing the hack in 80d631e

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py

* refactor Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * fixes Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * refactor mistral Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * add mixtral Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * some refactoring after introducing mlp Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * remove extranous files Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * add bnb Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * lint + fmt and improvements to readme Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * bench fixes * need to handle lora adapters device due to #26 * allow replay of failed benches, addressing comment in #14 * update benches (remove l40) --------- Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

achew010 requested a review from fabianlim as a code owner May 29, 2024 02:43

fabianlim reviewed May 29, 2024

View reviewed changes

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py Outdated Show resolved Hide resolved

achew010 added 3 commits May 29, 2024 02:53

workaround low-mem patch

cd17c7e

resolve conflicts and define patch function

b6369c0

resolve conflicts and define patch function

6764755

achew010 force-pushed the gptq-low-mem-mode-fix branch from e4e32b6 to 6764755 Compare May 29, 2024 03:28

achew010 self-assigned this May 29, 2024

fabianlim reviewed May 29, 2024

View reviewed changes

plugins/accelerated-peft/src/fms_acceleration_peft/autogptq_utils.py Outdated Show resolved Hide resolved

fabianlim reviewed May 29, 2024

View reviewed changes

plugins/accelerated-peft/src/fms_acceleration_peft/autogptq_utils.py Show resolved Hide resolved

fabianlim reviewed May 29, 2024

View reviewed changes

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py Show resolved Hide resolved

achew010 and others added 2 commits May 29, 2024 14:31

Apply suggestions from code review

4334fe1

Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>

revert hack to avoid low memory bug in HF memory metrics calculation

80d631e

fabianlim reviewed May 29, 2024

View reviewed changes

plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py Outdated Show resolved Hide resolved

fabianlim linked an issue May 29, 2024 that may be closed by this pull request

Allow AutoGPTQ to work in low cpu memory mode #18

Closed

fabianlim mentioned this pull request May 29, 2024

Allow AutoGPTQ to work in low cpu memory mode #18

Closed

achew010 removed their assignment May 29, 2024

achew010 added 2 commits May 29, 2024 07:45

reversed formatting

773d685

reverse more formatting

92ebac0

fabianlim approved these changes May 29, 2024

View reviewed changes

fabianlim merged commit 25171a0 into foundation-model-stack:dev May 29, 2024
3 checks passed

fabianlim added a commit to fabianlim/fms-acceleration that referenced this pull request May 31, 2024

need to handle lora adapters device due to foundation-model-stack#26

f666d5e

fabianlim mentioned this pull request May 31, 2024

Failure in FSDP Benchmark Experiment using QLoRA with Custom Fused Modules #3

Closed

fabianlim mentioned this pull request Jun 7, 2024

Upstream Main: Fused Ops and Kernels, FSDP and Memory Fixes #35

Merged

achew010 deleted the gptq-low-mem-mode-fix branch July 26, 2024 04:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26

Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26

achew010 commented May 29, 2024 •

edited by fabianlim

Loading

fabianlim commented May 29, 2024

Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26

Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26

Conversation

achew010 commented May 29, 2024 • edited by fabianlim Loading

Description

Tests

Reproduction command

Comparison

Before Fix:

After Fix:

fabianlim commented May 29, 2024

achew010 commented May 29, 2024 •

edited by fabianlim

Loading