Float8 inference examples #732

jainapurva · 2024-08-22T18:36:00Z

No description provided.

pytorch-bot · 2024-08-22T18:36:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/732

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

* sparsify -> sparsify_ * Update README.md

add default

* 1 more doc revamp * update * supriya feedback * upd

* README typos * Update README.md

Make developer experience better

current version is <Version('0.4.0.dev20240827+cu121')> which is smaller than "0.4.0", so current version should be updated to 0.5.0 instead failed this check: https://github.com/huggingface/transformers/blob/d47a9e8ce556be790ac98c0a9024dd41c6328fb0/src/transformers/utils/quantization_config.py#L1138

* Support for Llama 3.1 and kv_cache_quantization Summary: this PR has support for llama 3.1 and some improvements to kv_cache quantization and general peak memory performance for llama # summary of changes 1) add 3.1 support for llama 2) change quantized_kv_cache init so it doesn't create a full precision peak: see below 2) reorder causal mask init: see below 3) add option for linear causal mask: see below 4) add option for cache_size: the default generate.py behavior requires you do generate 32k tokens if you want to haev a size 32k kv_cache/causal_mask, the cache_size option lets you simply set the cache size but generate a smaller number of tokens to make it easier to benchmark 5) add option to generate memory profile: used to generate the images below | context length (tokens) | normal peak (GB) | kv_quant peak (GB) | kv quant+causal fix peak (GB) | |-------------------------|------------------|--------------------|-------------------------------| | 8192 | 17.86 | 17.52 | 17.47 | | 16384 | 19.81 | 18.75 | 18.48 | | 32768 | 23.83 | 21.72 | 20.64 | | 65536 | 33.5 | 29.54 | 25.24 | | 131072 | 59.27 | 52.62 | 34.18 | tests: see benchmarks.sh Reviewers: Subscribers: Tasks: Tags: * further kv_cache investigation Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * memory fixes for kv_cache quant Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix benchmarks Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

* bayesian optimization tool for mixed precision quantization * refactor * code refactor * integer type parameters * improve multi-process * fix a bug in symmetric quant * refactor BO optimize for model accuracy * add BO for inference speed optimization * rename BO for inference speed * refactor code * add utils * add some TODOs * renamed BO scripts * renamed to BO_acc_throughput * add TODOs

Summary: Recent refactor into tensor subclasses (#585) broke some existing use cases that rely on DDP and FSDP1, since the new flow only supports FSDP2 currently. This commit adds back the module swap API for now to provide a backdoor for these use cases. In the long term, we still plan to deprecate the module swap flow. Test Plan: python test/quantization/test_qat.py -k test_qat_8da4w_quantizer_module_swap python test/quantization/test_qat.py -k test_qat_4w_quantizer_module_swap Reviewers: jerryzh168, msaroufim Subscribers: jerryzh168, msaroufim

Summary: We want to add quant_llm to affine quantized tensor as a general fp2-fp7 dtype, before that we need to refactor the current implementation to work with AffineQuantizedTensor first Test Plan: python test/prototype/test_quant_llm.py Reviewers: Subscribers: Tasks: Tags:

Revert "Refactor quant_llm to work with affine quantized tensor (#696)" This reverts commit 0fed444.

* more empathy fixes * Update README.md * Update README.md * Update README.md

Revert "more empathy fixes (#759)" This reverts commit d916b9b.

Differential Revision: D61501686 Pull Request resolved: #766

Differential Revision: D61744019 Pull Request resolved: #773

Summary: This is useful for things such as: * activation_with_bounded_range -> linear (can set static scale to activation range) * bounding weight scales to known quantities if the modeling user can guarantee their magnitude throughout training We don't have signal yet that this is useful for production things, but it would be good to land this to enable easy experimentation. Test Plan: Unit and integration tests pass: ``` ./test/test_everything.sh // note that there is a failure in `test_fsdp2.py` which is present on main ``` Use float8 profiling script to see GPU kernel time go down as we enable static scaling on a toy model: https://gist.github.com/vkuzo/b2cf46f7cccb691125566873859ca39d Reviewers: Subscribers: Tasks: Tags:

* add readme and code refactor * edit readme * update README

Differential Revision: D60867909 Pull Request resolved: #774

… (#772) * [reland] Refactor quant_llm to work with affine quantized tensor (#696) Summary: We want to add quant_llm to affine quantized tensor as a general fp2-fp7 dtype, before that we need to refactor the current implementation to work with AffineQuantizedTensor first Test Plan: python test/prototype/test_quant_llm.py Reviewers: Subscribers: Tasks: Tags:

* mixin * fix memory being held by autograd

ghstack-source-id: 34fe56595eac4d1a2fecb07a230307c0b2b767d7 Pull Request resolved: #688

* add Llama3.1-8B finetune bench * update doc * Update README.md --------- Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>

…damFp8 torch requirement (#755) * update doc on torch version * update doc * update * fix 4-bit problem * update doc * update

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 22, 2024

jainapurva changed the base branch from main to experimental_float8_aqt August 22, 2024 18:54

a-r-r-o-w and others added 2 commits August 23, 2024 16:11

update README example with correct import of sparsify_ (#741)

eb47c93

* sparsify -> sparsify_ * Update README.md

Add AdamW to CPUOffloadOptimizer default (#742)

af68031

add default

jainapurva mentioned this pull request Aug 24, 2024

[Experimental] Float8 support in AQT #671

Merged

msaroufim and others added 24 commits August 25, 2024 14:05

1 more doc revamp (#745)

65707d2

* 1 more doc revamp * update * supriya feedback * upd

README typos (#747)

c2f4460

* README typos * Update README.md

Make developer experience better for extending AQT (#749)

37276d6

Make developer experience better

empathy day fixes (#757)

5ff59a0

Revert "Refactor quant_llm to work with affine quantized tensor" (#767)

da8c5b9

Revert "Refactor quant_llm to work with affine quantized tensor (#696)" This reverts commit 0fed444.

more empathy fixes (#759)

d916b9b

* more empathy fixes * Update README.md * Update README.md * Update README.md

Revert "more empathy fixes" (#768)

08ec338

Revert "more empathy fixes (#759)" This reverts commit d916b9b.

Move benchmarking infra code to torchao

9a0e9f0

Differential Revision: D61501686 Pull Request resolved: #766

[Experimental] Float8 support in AQT (#671)

0916b5b

CPU bandwidth benchmark

f67337c

Differential Revision: D61744019 Pull Request resolved: #773

add readme and code refactor (#776)

f538027

* add readme and code refactor * edit readme * update README

Update method names to support intx and floatx changes (#775)

cfabc13

torchao::parallel_for backends

09a5e54

Differential Revision: D60867909 Pull Request resolved: #774

Add Float8 Weight Only and FP8 weight + dynamic activation (#740)

d0e6246

* mixin * fix memory being held by autograd

add in formatting skip everything (#779)

ba2d3b1

ghstack-source-id: 34fe56595eac4d1a2fecb07a230307c0b2b767d7 Pull Request resolved: #688

[Low-bit optim] Add Llama2-7B finetune benchmarks (#746)

e5246fc

* add Llama3.1-8B finetune bench * update doc * Update README.md --------- Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>

[low-bit optim] Fix Adam4bit support on PyTorch 2.3 and 2.4. Update A…

65f660d

…damFp8 torch requirement (#755) * update doc on torch version * update doc * update * fix 4-bit problem * update doc * update

jainapurva added 27 commits September 5, 2024 15:56

Float8 updates

6644bb6

Updates for removing float8 API

12d92ad

Add float8wo to hf_eval

9fdf525

remove fpx layout

8f84d86

Add to llama

c80b39b

revert changes to hf_eval

956d8ed

Revert changes to main

8f2c8c0

reverted changes

09af4aa

revert

0f889f9

Fp8 upgrades

c3f0527

Test for float8

67cdc0b

Updates

7a0b56d

Remove from_float_float8

5db19c0

Update optional[int]

fd975e8

remove from_float8

be5b2d2

Review fixes

d8f934a

typos

ff96cba

typos

c0bd0a9

Added constraints

7773413

Added constraints

b004625

Remove eps check

21d466e

Seperate floatx from_float

b062f18

Typing fixes

d39f230

init fixes

1cdb807

Updates for fixes

a15da73

Review fixes

67d6c34

Revert a doc string change

9743a6b

jainapurva force-pushed the float8-inference-examples branch from af718a9 to 9743a6b Compare September 6, 2024 00:53

jainapurva closed this Sep 6, 2024

jainapurva deleted the float8-inference-examples branch September 9, 2024 23:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Float8 inference examples #732

Float8 inference examples #732

jainapurva commented Aug 22, 2024

pytorch-bot bot commented Aug 22, 2024 •

edited

Loading

Float8 inference examples #732

Float8 inference examples #732

Conversation

jainapurva commented Aug 22, 2024

pytorch-bot bot commented Aug 22, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/732

pytorch-bot bot commented Aug 22, 2024 •

edited

Loading