Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26

Merged
merged 7 commits into from
May 29, 2024

Conversation

achew010
Copy link
Contributor

@achew010 achew010 commented May 29, 2024

Description

This PR addresses #18 with the following contributions

  • Introduce patch on AutoGPTQ's make_sure_no_tensor_in_meta_device to avoid raising an error when model has no bias in low memory mode
  • Workaround to configuring device_map to cpu when loading checkpoints to avoid gpu memory consumption before trainer initialization.
    Note: This approach diverts consumption to cpu mem which could still bottleneck, a better approach could be to load it to meta device. QLoRA currently loads quantized models to cpu in low memory mode as well. See here.

TODO:

  • Actual device mapping to meta device

Tests

Reproduction command

accelerate launch --config_file scripts/benchmarks/accelerate.yaml --num_processes=2 --main_process_port=29500 -m tuning.sft_trainer --model_name_or_path TheBloke/Llama-2-70B-GPTQ --acceleration_framework_config_file /data/aaron/experimental/test3/scripts/benchmarks/../../sample-configurations/accelerated-peft-autogptq-sample-configuration.yaml --packing True --max_seq_len 4096 --learning_rate 2e-4 --fp16 True --torch_dtype float16 --peft_method lora --r 16 --lora_alpha 16 --lora_dropout 0.0 --target_modules q_proj k_proj v_proj o_proj --use_flash_attn True --response_template '\n### Response:' --dataset_text_field 'output' --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 10 --training_data_path /data/aaron/experimental/test3/benchmark_outputs_final/data/cache.json --per_device_train_batch_size 2 --output_dir benchmark_outputs/exp_57/hf --skip_memory_metrics False

Comparison

Before Fix:

Memory Explosion in GPTQ-LoRA without low memory mode observed in the memory metrics, Nvidia (78.80 GiB) and Torch (36.1 GiB) compared to QLoRA with low memory mode enabled.

model
name
framework
config
num
gpus
per device
train
batch
size
nvidia
mem reserved
(GiB)
peak torch
mem alloc
(GiB)
torch
mem alloc
(GiB)
throughput (toks/sec)
NousResearch/Llama-2-70b-hf accelerated-peft-bnb 2 2 51.40 46.52 19.17 417
TheBloke/Llama-2-70B-GPTQ accelerated-peft-autogptq 2 2 78.80 45.40 36.14 429

After Fix:

With Low Memory mode enabled, GPTQ-LoRA now has lower memory consumption Nvidia (49.4 GiB) and Torch (18.1 GiB) and is comparable with QLoRA

model
name
framework
config
num
gpus
per device
train
batch
size
nvidia
mem reserved
(GiB)
peak torch
mem alloc
(GiB)
torch
mem alloc
(GiB)
throughput (toks/sec)
NousResearch/Llama-2-70b-hf accelerated-peft-bnb 2 2 51.40 46.52 19.17 414
TheBloke/Llama-2-70B-GPTQ accelerated-peft-autogptq 2 2 49.44 44.87 18.13 428

@achew010 achew010 requested a review from fabianlim as a code owner May 29, 2024 02:43
@achew010 achew010 self-assigned this May 29, 2024
achew010 and others added 2 commits May 29, 2024 14:31
@fabianlim
Copy link
Contributor

@achew010 can you update the top-level comment, with what was the previous memory allocation, and verify that the new measurements are obtained after reversing the hack in 80d631e

@fabianlim fabianlim linked an issue May 29, 2024 that may be closed by this pull request
@achew010 achew010 removed their assignment May 29, 2024
@fabianlim fabianlim merged commit 25171a0 into foundation-model-stack:dev May 29, 2024
3 checks passed
fabianlim added a commit to fabianlim/fms-acceleration that referenced this pull request May 31, 2024
fabianlim added a commit that referenced this pull request Jun 2, 2024
* refactor

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* fixes

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* refactor mistral

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* add mixtral

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* some refactoring after introducing mlp

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* remove extranous files

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* add bnb

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* lint + fmt and improvements to readme

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>

* bench fixes

* need to handle lora adapters device due to #26

* allow replay of failed benches, addressing comment in #14

* update benches (remove l40)

---------

Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
@achew010 achew010 deleted the gptq-low-mem-mode-fix branch July 26, 2024 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow AutoGPTQ to work in low cpu memory mode
2 participants