Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

077 autoquant gpt fast #361

Merged
merged 8 commits into from
Jun 21, 2024
Merged

077 autoquant gpt fast #361

merged 8 commits into from
Jun 21, 2024

Conversation

HDCharles
Copy link
Contributor

@HDCharles HDCharles commented Jun 14, 2024

Summary:

autoquant wasn't working for llama benchmarks for a few reasons, the main
ones being that we were doing autoquant logging on prefill not decode_one_token which is an issue since the two
have different shapes. We also weren't torch.compiling prefill which obviated the whole point of
autoquant benchmarking torch.compiled prefill shapes.

To fix this, new functionality was needed for autoquant, we needed an
option to not automatically end logging upon a single instance of
model.forward. The flag manual now controls whether you
manually have to call model.finalize_autoquant() after logging is done, or
whether it happens automatically after a model forward run.

a few small other fixes were also made:

  1. updated where generate.py resets cuda memory so as to not confound
    with torch.compilation memory usage
  2. README updated with new numbers
  3. better autoquant docstring
  4. reordered benchmarks so they match whats in the README
  5. cleaned up a few autoquant print multi shape print bugs

Test Plan: sh benchmarks.sh

python test_integration.py -k "test_autoquant_manual"

Reviewers:

Copy link

pytorch-bot bot commented Jun 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/361

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit c16593e with merge base 6b0ca2d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2024
torchao/_models/llama/generate.py Outdated Show resolved Hide resolved
torchao/quantization/autoquant.py Outdated Show resolved Hide resolved
torchao/utils.py Outdated Show resolved Hide resolved
torchao/quantization/autoquant.py Show resolved Hide resolved
test/integration/test_integration.py Outdated Show resolved Hide resolved
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more minor pieces of feedback

@parameterized.expand(COMMON_DEVICE_DTYPE)
@unittest.skipIf(not TORCH_VERSION_AFTER_2_3, "autoquant requires 2.3+.")
def test_autoquant_manual(self, device, dtype):
if device != "cuda" and dtype != torch.bfloat16:
Copy link
Member

@msaroufim msaroufim Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the idea here to say skip if the device is cpu and using bf16? if so can we flip the negatives to make this clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this, i copied something from other tests at some point

Wraps the given model in an AutoQuantWrapper. If `example_input` is provided, performs a forward pass on the input.
Otherwise, returns the wrapped model. The AutoQuantWrapper manages cases where the model is torch-compiled by first
performing autoquantization on the original model and then allowing the torch.compile run/tracing to occur.
Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential
Copy link
Member

@msaroufim msaroufim Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool this helped quite a bit, can you also make sure it renders correctly here https://github.com/pytorch/ao/tree/main/docs/source

Summary: we were hitting the peak upon model load, not during model
runtime, this is an issue since users can load model to cpu/meta which
significantly reduces mem usage during model load/quant.

Test Plan: sh benchmarks.sh

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

autoquant wasn't working for llama benchmarks for a few reasons the main
one being that we were doing logging on prefill not decode_one_token. We
also weren't torch.compiling prefill which obviated the whole point of
autoquant benchmarking torch.compiled prefill shapes.

To fix this, new functionality was needed for autoquant, we needed an
option to not automatically end logging upon a single instance of
model.forward. The flag manual_do_autoquant now controls whether you
manually have to call model.do_autoquant() after logging is done, or
whether it happens automatically after a model forward run.

a few small other fixes were also made:
1) updated where generate.py resets cuda memory so as to not confound
   with torch.compilation memory usage
2) README updated with new numbers
3) better autoquant docstring
5) reordered benchmarks so they match whats in the README

Test Plan: sh benchmarks.sh

python test_integration.py -k "test_autoquant_manual"

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:
sh benchmarks.sh
Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Comment on lines 485 to 488
torch.autoquant(model, manual=True)
model(*example_input1)
model(*example_input2)
model.do_autoquant()
Copy link
Contributor

@jerryzh168 jerryzh168 Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having both autoquant and do_autoquant seems a bit confusing

also can do_autoquant (maybe a different name) also be a function like autoquant

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall, just the do_autoquant API feels a bit weird, I think we can make it more intuitive

Comment on lines 208 to 220
model = autoquant(model, manual=True)

generate(
model,
encode_tokens(tokenizer, prompt, bos=True, device=device),
2,
interactive=False
max_new_tokens,
interactive=False,
temperature=temperature,
top_k=top_k,
)

# do autoquantization
model.do_autoquant()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also comment on why does this have to autoquant in this way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we optimize for the shapes autoquant sees during shape calibration so we have to run the full generate loop, which means we need a way to manually end shape calibration and initialize benchmarking/quantization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to do shape calibration with the actual shapes used for generate, so we setup autoquant, set it to wait until we manually end shape calibration, run the generate set to log the correct shapes, then do_autoquant to actually do the benchmarks with the shapes that we've logged

@jerryzh168 jerryzh168 self-requested a review June 19, 2024 00:20
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@HDCharles
Copy link
Contributor Author

looks good overall, just the do_autoquant API feels a bit weird, I think we can make it more intuitive

what terms would you use?
the difficulty is we want the api to stand on its own, but also work in a manual mode.

so

torchao.autoquant(model)

seems fine

torchao.autoquant(model, manual=True)

seems ok, maybe manual could be different? it doesn't seem super off though, the flag is 'prolonging shape calibration for multiple inputs (rather than ending after a single input and doing the benchmarks + quantization) and the user has to manually end shape calibration'

other terms could be 'manual_shape_calibration_end', 'multi_input', 'defer_finalization'....of these manual seems like the best of a bad bunch tbh

lastly there's:

model.do_autoquant()

which ends shape calibration, does benchmarking on the calibrated shapes, picks the best one and then quantizes the layers. Feels like this could be 'finalize', 'finalize_autoquant' or something along those lines. This is the step where autoquantization actually happens though so do_autoquant is literal, its just that, when manual=True, the original autoquant api is less than completely accurate, but i don't see a good way around that.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@HDCharles HDCharles merged commit dd35079 into main Jun 21, 2024
13 checks passed
dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
* fixing peak memory stats for benchmark

Summary: we were hitting the peak upon model load, not during model
runtime, this is an issue since users can load model to cpu/meta which
significantly reduces mem usage during model load/quant.

Test Plan: sh benchmarks.sh

Reviewers:

Subscribers:

Tasks:

Tags:

* Autoquantization work for benchmarks

Summary:

autoquant wasn't working for llama benchmarks for a few reasons the main
one being that we were doing logging on prefill not decode_one_token. We
also weren't torch.compiling prefill which obviated the whole point of
autoquant benchmarking torch.compiled prefill shapes.

To fix this, new functionality was needed for autoquant, we needed an
option to not automatically end logging upon a single instance of
model.forward. The flag manual_do_autoquant now controls whether you
manually have to call model.do_autoquant() after logging is done, or
whether it happens automatically after a model forward run.

a few small other fixes were also made:
1) updated where generate.py resets cuda memory so as to not confound
   with torch.compilation memory usage
2) README updated with new numbers
3) better autoquant docstring
5) reordered benchmarks so they match whats in the README

Test Plan: sh benchmarks.sh

python test_integration.py -k "test_autoquant_manual"

Reviewers:

Subscribers:

Tasks:

Tags:

* updating api name and improving docstrings

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* oops missed a few manual_do_autoquant -> manual

Summary:

Test Plan:
sh benchmarks.sh
Reviewers:

Subscribers:

Tasks:

Tags:

* fix forward_log_only

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* improving test conditions

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fixing nits

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* final tests and change do_autoquant to finalize_autoquant

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants