-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26
Merged
fabianlim
merged 7 commits into
foundation-model-stack:dev
from
achew010:gptq-low-mem-mode-fix
May 29, 2024
Merged
Workaround Low-Mem-Mode Patch for GPTQ-LoRA #26
fabianlim
merged 7 commits into
foundation-model-stack:dev
from
achew010:gptq-low-mem-mode-fix
May 29, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fabianlim
reviewed
May 29, 2024
plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py
Outdated
Show resolved
Hide resolved
achew010
force-pushed
the
gptq-low-mem-mode-fix
branch
from
May 29, 2024 03:28
e4e32b6
to
6764755
Compare
fabianlim
reviewed
May 29, 2024
plugins/accelerated-peft/src/fms_acceleration_peft/autogptq_utils.py
Outdated
Show resolved
Hide resolved
fabianlim
reviewed
May 29, 2024
fabianlim
reviewed
May 29, 2024
plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py
Show resolved
Hide resolved
Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
fabianlim
reviewed
May 29, 2024
plugins/accelerated-peft/src/fms_acceleration_peft/framework_plugin_autogptq.py
Outdated
Show resolved
Hide resolved
fabianlim
approved these changes
May 29, 2024
fabianlim
added a commit
to fabianlim/fms-acceleration
that referenced
this pull request
May 31, 2024
fabianlim
added a commit
that referenced
this pull request
Jun 2, 2024
* refactor Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * fixes Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * refactor mistral Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * add mixtral Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * some refactoring after introducing mlp Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * remove extranous files Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * add bnb Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * lint + fmt and improvements to readme Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com> * bench fixes * need to handle lora adapters device due to #26 * allow replay of failed benches, addressing comment in #14 * update benches (remove l40) --------- Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR addresses #18 with the following contributions
make_sure_no_tensor_in_meta_device
to avoid raising an error when model has no bias in low memory modedevice_map
tocpu
when loading checkpoints to avoid gpu memory consumption before trainer initialization.Note: This approach diverts consumption to cpu mem which could still bottleneck, a better approach could be to load it to
meta
device. QLoRA currently loads quantized models tocpu
in low memory mode as well. See here.TODO:
meta
deviceTests
Reproduction command
Comparison
Before Fix:
Memory Explosion in GPTQ-LoRA without low memory mode observed in the memory metrics, Nvidia (78.80 GiB) and Torch (36.1 GiB) compared to QLoRA with low memory mode enabled.
name
config
gpus
train
batch
size
mem reserved
(GiB)
mem alloc
(GiB)
mem alloc
(GiB)
After Fix:
With Low Memory mode enabled, GPTQ-LoRA now has lower memory consumption Nvidia (49.4 GiB) and Torch (18.1 GiB) and is comparable with QLoRA
name
config
gpus
train
batch
size
mem reserved
(GiB)
mem alloc
(GiB)
mem alloc
(GiB)