-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Float8 inference examples #732
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/732
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
facebook-github-bot
added
the
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
label
Aug 22, 2024
add default
* 1 more doc revamp * update * supriya feedback * upd
* README typos * Update README.md
Make developer experience better
current version is <Version('0.4.0.dev20240827+cu121')> which is smaller than "0.4.0", so current version should be updated to 0.5.0 instead failed this check: https://github.com/huggingface/transformers/blob/d47a9e8ce556be790ac98c0a9024dd41c6328fb0/src/transformers/utils/quantization_config.py#L1138
* Support for Llama 3.1 and kv_cache_quantization Summary: this PR has support for llama 3.1 and some improvements to kv_cache quantization and general peak memory performance for llama # summary of changes 1) add 3.1 support for llama 2) change quantized_kv_cache init so it doesn't create a full precision peak: see below 2) reorder causal mask init: see below 3) add option for linear causal mask: see below 4) add option for cache_size: the default generate.py behavior requires you do generate 32k tokens if you want to haev a size 32k kv_cache/causal_mask, the cache_size option lets you simply set the cache size but generate a smaller number of tokens to make it easier to benchmark 5) add option to generate memory profile: used to generate the images below | context length (tokens) | normal peak (GB) | kv_quant peak (GB) | kv quant+causal fix peak (GB) | |-------------------------|------------------|--------------------|-------------------------------| | 8192 | 17.86 | 17.52 | 17.47 | | 16384 | 19.81 | 18.75 | 18.48 | | 32768 | 23.83 | 21.72 | 20.64 | | 65536 | 33.5 | 29.54 | 25.24 | | 131072 | 59.27 | 52.62 | 34.18 | tests: see benchmarks.sh Reviewers: Subscribers: Tasks: Tags: * further kv_cache investigation Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * memory fixes for kv_cache quant Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix benchmarks Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
* bayesian optimization tool for mixed precision quantization * refactor * code refactor * integer type parameters * improve multi-process * fix a bug in symmetric quant * refactor BO optimize for model accuracy * add BO for inference speed optimization * rename BO for inference speed * refactor code * add utils * add some TODOs * renamed BO scripts * renamed to BO_acc_throughput * add TODOs
Summary: Recent refactor into tensor subclasses (#585) broke some existing use cases that rely on DDP and FSDP1, since the new flow only supports FSDP2 currently. This commit adds back the module swap API for now to provide a backdoor for these use cases. In the long term, we still plan to deprecate the module swap flow. Test Plan: python test/quantization/test_qat.py -k test_qat_8da4w_quantizer_module_swap python test/quantization/test_qat.py -k test_qat_4w_quantizer_module_swap Reviewers: jerryzh168, msaroufim Subscribers: jerryzh168, msaroufim
Summary: We want to add quant_llm to affine quantized tensor as a general fp2-fp7 dtype, before that we need to refactor the current implementation to work with AffineQuantizedTensor first Test Plan: python test/prototype/test_quant_llm.py Reviewers: Subscribers: Tasks: Tags:
* more empathy fixes * Update README.md * Update README.md * Update README.md
Differential Revision: D61501686 Pull Request resolved: #766
Differential Revision: D61744019 Pull Request resolved: #773
Summary: This is useful for things such as: * activation_with_bounded_range -> linear (can set static scale to activation range) * bounding weight scales to known quantities if the modeling user can guarantee their magnitude throughout training We don't have signal yet that this is useful for production things, but it would be good to land this to enable easy experimentation. Test Plan: Unit and integration tests pass: ``` ./test/test_everything.sh // note that there is a failure in `test_fsdp2.py` which is present on main ``` Use float8 profiling script to see GPU kernel time go down as we enable static scaling on a toy model: https://gist.github.com/vkuzo/b2cf46f7cccb691125566873859ca39d Reviewers: Subscribers: Tasks: Tags:
* add readme and code refactor * edit readme * update README
Differential Revision: D60867909 Pull Request resolved: #774
… (#772) * [reland] Refactor quant_llm to work with affine quantized tensor (#696) Summary: We want to add quant_llm to affine quantized tensor as a general fp2-fp7 dtype, before that we need to refactor the current implementation to work with AffineQuantizedTensor first Test Plan: python test/prototype/test_quant_llm.py Reviewers: Subscribers: Tasks: Tags:
* mixin * fix memory being held by autograd
ghstack-source-id: 34fe56595eac4d1a2fecb07a230307c0b2b767d7 Pull Request resolved: #688
* add Llama3.1-8B finetune bench * update doc * Update README.md --------- Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>
…damFp8 torch requirement (#755) * update doc on torch version * update doc * update * fix 4-bit problem * update doc * update
jainapurva
force-pushed
the
float8-inference-examples
branch
from
September 6, 2024 00:53
af718a9
to
9743a6b
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.