Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float8 inference examples #732

Closed
wants to merge 81 commits into from

Conversation

jainapurva
Copy link
Contributor

No description provided.

Copy link

pytorch-bot bot commented Aug 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/732

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 22, 2024
@jainapurva jainapurva changed the base branch from main to experimental_float8_aqt August 22, 2024 18:54
msaroufim and others added 24 commits August 25, 2024 14:05
* 1 more doc revamp

* update

* supriya feedback

* upd
* README typos

* Update README.md
current version is <Version('0.4.0.dev20240827+cu121')> which is smaller than "0.4.0", so current version should be updated to 0.5.0 instead

failed this check: https://github.com/huggingface/transformers/blob/d47a9e8ce556be790ac98c0a9024dd41c6328fb0/src/transformers/utils/quantization_config.py#L1138
* Support for Llama 3.1 and kv_cache_quantization

Summary:

this PR has support for llama 3.1 and some improvements to kv_cache quantization and general peak memory performance for llama

# summary of changes

1) add 3.1 support for llama
2) change quantized_kv_cache init so it doesn't create a full precision peak: see below
2) reorder causal mask init: see below
3) add option for linear causal mask: see below
4) add option for cache_size: the default generate.py behavior requires you do generate 32k tokens if you want to haev a size 32k kv_cache/causal_mask, the cache_size option lets you simply set the cache size but generate a smaller number of tokens to make it easier to benchmark
5) add option to generate memory profile: used to generate the images below

| context length (tokens) | normal peak (GB) | kv_quant peak (GB) | kv quant+causal fix peak (GB) |
|-------------------------|------------------|--------------------|-------------------------------|
|                    8192 |            17.86 |              17.52 |                         17.47 |
|                   16384 |            19.81 |              18.75 |                         18.48 |
|                   32768 |            23.83 |              21.72 |                         20.64 |
|                   65536 |             33.5 |              29.54 |                         25.24 |
|                  131072 |            59.27 |              52.62 |                         34.18 |

tests:

see benchmarks.sh

Reviewers:

Subscribers:

Tasks:

Tags:

* further kv_cache investigation

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* memory fixes for kv_cache quant

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* fix benchmarks

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
* bayesian optimization tool for mixed precision quantization

* refactor

* code refactor

* integer type parameters

* improve multi-process

* fix a bug in symmetric quant

* refactor BO optimize for model accuracy

* add BO for inference speed optimization

* rename BO for inference speed

* refactor code

* add utils

* add some TODOs

* renamed BO scripts

* renamed to BO_acc_throughput

* add TODOs
Summary: Recent refactor into tensor subclasses (#585) broke
some existing use cases that rely on DDP and FSDP1, since the
new flow only supports FSDP2 currently. This commit adds back
the module swap API for now to provide a backdoor for these
use cases. In the long term, we still plan to deprecate the
module swap flow.

Test Plan:
python test/quantization/test_qat.py -k test_qat_8da4w_quantizer_module_swap
python test/quantization/test_qat.py -k test_qat_4w_quantizer_module_swap

Reviewers: jerryzh168, msaroufim

Subscribers: jerryzh168, msaroufim
Summary:
We want to add quant_llm to affine quantized tensor as a general fp2-fp7 dtype,
before that we need to refactor the current implementation to work with AffineQuantizedTensor first

Test Plan:
python test/prototype/test_quant_llm.py

Reviewers:

Subscribers:

Tasks:

Tags:
Revert "Refactor quant_llm to work with affine quantized tensor (#696)"

This reverts commit 0fed444.
* more empathy fixes

* Update README.md

* Update README.md

* Update README.md
Revert "more empathy fixes (#759)"

This reverts commit d916b9b.
Differential Revision: D61501686

Pull Request resolved: #766
Differential Revision: D61744019

Pull Request resolved: #773
Summary:

This is useful for things such as:
* activation_with_bounded_range -> linear (can set static scale to
  activation range)
* bounding weight scales to known quantities if the modeling user
  can guarantee their magnitude throughout training

We don't have signal yet that this is useful for production things,
but it would be good to land this to enable easy experimentation.

Test Plan:

Unit and integration tests pass:
```
./test/test_everything.sh
// note that there is a failure in `test_fsdp2.py` which is present on main
```

Use float8 profiling script to see GPU kernel time go down as we
enable static scaling on a toy model:
https://gist.github.com/vkuzo/b2cf46f7cccb691125566873859ca39d

Reviewers:

Subscribers:

Tasks:

Tags:
* add readme and code refactor

* edit readme

* update README
Differential Revision: D60867909

Pull Request resolved: #774
… (#772)

* [reland] Refactor quant_llm to work with affine quantized tensor (#696)

Summary:
We want to add quant_llm to affine quantized tensor as a general fp2-fp7 dtype,
before that we need to refactor the current implementation to work with AffineQuantizedTensor first

Test Plan:
python test/prototype/test_quant_llm.py

Reviewers:

Subscribers:

Tasks:

Tags:
* mixin

* fix memory being held by autograd
ghstack-source-id: 34fe56595eac4d1a2fecb07a230307c0b2b767d7
Pull Request resolved: #688
* add Llama3.1-8B finetune bench

* update doc

* Update README.md

---------

Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>
…damFp8 torch requirement (#755)

* update doc on torch version

* update doc

* update

* fix 4-bit problem

* update doc

* update
@jainapurva jainapurva closed this Sep 6, 2024
@jainapurva jainapurva deleted the float8-inference-examples branch September 9, 2024 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.