Check if the buffers fit GPU memory after device map auto inferred #2412

notsyncing · 2024-02-02T06:56:42Z

What does this PR do?

Hello, I'm trying using accelerate to offload a large model (https://huggingface.co/TheBloke/WizardCoder-33B-V1.1-GPTQ) to CPU, with following code (requires #2383 if using Intel GPU, and huggingface/transformers#28755):

import datetime

import torch
import intel_extension_for_pytorch

import accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "/mnt/external2/LLMs/WizardCoder-33B-V1.1-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Without offload_buffers (OOM)
pipe = pipeline("text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16, "device_map": "auto", "max_memory": {0: "16GB", "cpu": "128GB"}})

# With offload_buffers (won't OOM)
#pipe = pipeline("text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16, "device_map": "auto", "max_memory": {0: "16GB", "cpu": "128GB"}, "offload_buffers": True})
print(str(pipe.model.hf_device_map))

print(str(datetime.datetime.now()) + " Generating...")
results = pipe("public void helloWorld() {")

print(str(datetime.datetime.now()) + " Output:")
print(results)

I have one Intel Arc A770 16GB GPU in my machine, but the code above always OOM on the GPU.

After some digging, I found that this model TheBloke/WizardCoder-33B-V1.1-GPTQ contains a huge buffer (model is 17GB, but buffers are more than 16GB), so it will not fit into my GPU at all without offload_buffers.

So I created this PR, to check if the buffer can fit on the GPU, and raise an exception for such case.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-02-09T14:16:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

Thanks for digging into the buffer issue @notsyncing ! Could you add some tests to see that the buffer calculation is right and that we raise the error when the buffers don't fit the remaining space on the gpu. Can you have a second look @muellerzr ?

src/accelerate/utils/modeling.py

SunMarc · 2024-02-29T18:49:27Z

Hi @notsyncing, thanks for your work ! I've replied to your questions. This is not pressing but are you planning to finish this PR ? Otherwise, I can finish it and add you as co-author =)

notsyncing · 2024-03-01T00:59:19Z

Hi @notsyncing, thanks for your work ! I've replied to your questions. This is not pressing but are you planning to finish this PR ? Otherwise, I can finish it and add you as co-author =)

Yes, I'm planning to finish it. But for the multi-gpu scenario, I cannot test it because I have only one gpu. Would you mind helping testing this? Thanks!

muellerzr · 2024-03-01T01:02:24Z

We can run the relevant tests for you when you are ready/the normal CI here is green and report back any failures (on a multi-gpu runner), yes :)

notsyncing · 2024-03-01T04:46:20Z

Basically, what I mean is if the user let a device_map that only most of the modules on the second gpus and leave of the space available on the first gpu, the buffer will most likely fit in the first gpu (if we suppose the first gpu is the execution device). For example, this is the case with device_map="balanced_low_0" strategy.

Now I added a check to see if the buffer can fit any GPU, and only print the warning when every GPU does not fit. Will this work in your scenario?

notsyncing · 2024-03-01T13:15:55Z

Could you add some tests to see that the buffer calculation is right and that we raise the error when the buffers don't fit the remaining space on the gpu.

As for the tests, it seems difficult to write a test for this if it only prints a warning. I propose that we could use a environment variable (like ACCELERATE_BUFFER_SIZE_CHECK = "raise") to make it raise an exception instead:

if os.getenv("ACCELERATE_BUFFER_SIZE_CHECK") == "raise":
    raise ValueError("xxx")
else:
    logger.warn("yyy")

Would this be better?

muellerzr · 2024-03-01T13:29:50Z

You can use warnings.warning() to do so, and unittest has a method to check for a warning: with self.assertWarns(Warning):

muellerzr · 2024-03-01T13:30:18Z

Also there's no need to force push, we squash the commit history when merging

SunMarc · 2024-03-01T15:50:19Z

Now I added a check to see if the buffer can fit any GPU, and only print the warning when every GPU does not fit. Will this work in your scenario?

Thanks ! Let's do that first and see the feedback. We can always change this behavior if needed. Most probably, the user will receive this warning if all his gpus are full and some of the layers are offloaded to cpu/disk. Hence, the buffer won't fit any of the gpus.

SunMarc

Thanks for the iteration. I left a few comments. Could you also add tests that would trigger the warning for a single and multi gpu setup ? We can test the multi gpu test if needed !

src/accelerate/utils/modeling.py

notsyncing · 2024-03-02T12:33:52Z

@SunMarc I have added two test cases in test_modeling_utils.py, and fixed a problem that my previous code did not consider the reserved memory for the largest layer. Looks like we don't need an actual multi-gpu setup to test this.

* For some models, like TheBloke/WizardCoder-33B-V1.1-GPTQ, contain a huge buffer, which may cause OOM on GPU memory if not using offload_buffers. This commit adds a check for such case.

SunMarc

Thanks for working on this PR and adding the tests ! I ran them and they passed. I left a couple of minor comments/suggestions. We can merge this PR after fixing these =)

src/accelerate/utils/modeling.py

tests/test_modeling_utils.py

notsyncing · 2024-03-06T07:45:00Z

Thanks for working on this PR and adding the tests ! I ran them and they passed. I left a couple of minor comments/suggestions. We can merge this PR after fixing these =)

@SunMarc all fixed.

SunMarc

Thanks for iterating ! LGTM. I re-opened a comment I made. Could you have a quick look ? Also, can you have a second look @muellerzr ?

notsyncing · 2024-03-08T04:23:40Z

Thanks for iterating ! LGTM. I re-opened a comment I made. Could you have a quick look ? Also, can you have a second look @muellerzr ?

Added the missing assertions. I did not notice them before, sorry 😂

muellerzr

Thanks! Overall this looks sound to me :)

notsyncing changed the title ~~Check if the buffers fits GPU memory after device map auto inferred~~ Check if the buffers fit GPU memory after device map auto inferred Feb 2, 2024

notsyncing force-pushed the buffer-size-check branch from 60b749c to fb87eb1 Compare February 2, 2024 06:57

muellerzr requested a review from SunMarc February 2, 2024 13:00

notsyncing force-pushed the buffer-size-check branch from fb87eb1 to a3e5732 Compare February 4, 2024 01:13

notsyncing marked this pull request as ready for review February 4, 2024 02:21

notsyncing force-pushed the buffer-size-check branch from a3e5732 to a1eac15 Compare February 10, 2024 05:16

SunMarc reviewed Feb 14, 2024

View reviewed changes

notsyncing force-pushed the buffer-size-check branch from a1eac15 to 0e4c945 Compare March 1, 2024 04:43

SunMarc reviewed Mar 1, 2024

View reviewed changes

src/accelerate/utils/modeling.py Outdated Show resolved Hide resolved

src/accelerate/utils/modeling.py Outdated Show resolved Hide resolved

src/accelerate/utils/modeling.py Outdated Show resolved Hide resolved

notsyncing force-pushed the buffer-size-check branch from 0e4c945 to 5a3dc39 Compare March 2, 2024 12:30

notsyncing force-pushed the buffer-size-check branch from 5a3dc39 to 15f6544 Compare March 2, 2024 12:52

muellerzr requested a review from SunMarc March 4, 2024 14:10

notsyncing force-pushed the buffer-size-check branch from 15f6544 to 01c7a56 Compare March 5, 2024 00:39

Check if the buffers fit GPU memory after device map auto inferred

2a278f4

* For some models, like TheBloke/WizardCoder-33B-V1.1-GPTQ, contain a huge buffer, which may cause OOM on GPU memory if not using offload_buffers. This commit adds a check for such case.

notsyncing force-pushed the buffer-size-check branch from 01c7a56 to 2a278f4 Compare March 5, 2024 00:40

SunMarc reviewed Mar 5, 2024

View reviewed changes

Minor refactors.

91ed9de

SunMarc approved these changes Mar 7, 2024

View reviewed changes

Add missing assertions

f67ac00

muellerzr approved these changes Mar 8, 2024

View reviewed changes

SunMarc merged commit e3d3242 into huggingface:main Mar 8, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check if the buffers fit GPU memory after device map auto inferred #2412

Check if the buffers fit GPU memory after device map auto inferred #2412

notsyncing commented Feb 2, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 9, 2024

SunMarc left a comment

SunMarc commented Feb 29, 2024

notsyncing commented Mar 1, 2024

muellerzr commented Mar 1, 2024 •

edited

Loading

notsyncing commented Mar 1, 2024

notsyncing commented Mar 1, 2024 •

edited

Loading

muellerzr commented Mar 1, 2024

muellerzr commented Mar 1, 2024

SunMarc commented Mar 1, 2024

SunMarc left a comment

notsyncing commented Mar 2, 2024

SunMarc left a comment •

edited

Loading

notsyncing commented Mar 6, 2024

SunMarc left a comment

notsyncing commented Mar 8, 2024

muellerzr left a comment

Check if the buffers fit GPU memory after device map auto inferred #2412

Check if the buffers fit GPU memory after device map auto inferred #2412

Conversation

notsyncing commented Feb 2, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Feb 9, 2024

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc commented Feb 29, 2024

notsyncing commented Mar 1, 2024

muellerzr commented Mar 1, 2024 • edited Loading

notsyncing commented Mar 1, 2024

notsyncing commented Mar 1, 2024 • edited Loading

muellerzr commented Mar 1, 2024

muellerzr commented Mar 1, 2024

SunMarc commented Mar 1, 2024

SunMarc left a comment

Choose a reason for hiding this comment

notsyncing commented Mar 2, 2024

SunMarc left a comment • edited Loading

Choose a reason for hiding this comment

notsyncing commented Mar 6, 2024

SunMarc left a comment

Choose a reason for hiding this comment

notsyncing commented Mar 8, 2024

muellerzr left a comment

Choose a reason for hiding this comment

notsyncing commented Feb 2, 2024 •

edited

Loading

muellerzr commented Mar 1, 2024 •

edited

Loading

notsyncing commented Mar 1, 2024 •

edited

Loading

SunMarc left a comment •

edited

Loading