-
Notifications
You must be signed in to change notification settings - Fork 284
Update README #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Update README #187
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
e187343
Update README
msaroufim c798138
yolo
msaroufim 96ba718
yolo
msaroufim a8e5609
Merge branch 'main' into msaroufim/docfixes
msaroufim b36adf3
Update README.md
msaroufim 30627b5
yolo
msaroufim 815c9ad
yolo
msaroufim 5ca0574
yolo
msaroufim File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,66 +1,133 @@ | ||
# torchao: PyTorch Architecture Optimization | ||
|
||
**Note: This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an GitHub issue** | ||
[](discord.gg/cudamode) | ||
|
||
This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues) | ||
|
||
## Introduction | ||
torchao is a PyTorch native library for optimizing your models using lower precision dtypes, techniques like quantization and sparsity and performant kernels. | ||
`torchao` is a PyTorch library for quantization and sparsity. | ||
|
||
## Get Started | ||
To try out our APIs, you can check out API examples in [quantization](./torchao/quantization) (including `autoquant`), [sparsity](./torchao/sparsity), [dtypes](./torchao/dtypes). | ||
|
||
## Installation | ||
**Note: this library makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.** | ||
### Installation | ||
`torchao` makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch. | ||
|
||
1. From PyPI: | ||
Stable Release | ||
```Shell | ||
pip install torchao | ||
``` | ||
|
||
2. From Source: | ||
Nightly Release | ||
```Shell | ||
pip install torchao-nightly | ||
``` | ||
|
||
From source | ||
|
||
```Shell | ||
git clone https://github.com/pytorch-labs/ao | ||
git clone https://github.com/pytorch/ao | ||
cd ao | ||
python setup.py develop | ||
``` | ||
|
||
## Key Features | ||
The library provides | ||
1. Support for lower precision [dtypes](./torchao/dtypes) such as nf4, uint4 that are torch.compile friendly | ||
2. [Quantization algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile. | ||
* Int8 dynamic activation quantization | ||
* Int8 and int4 weight-only quantization | ||
* Int8 dynamic activation quantization with int4 weight quantization | ||
* [GPTQ](https://arxiv.org/abs/2210.17323) and [Smoothquant](https://arxiv.org/abs/2211.10438) | ||
* High level `autoquant` API and kernel auto tuner targeting SOTA performance across varying model shapes on consumer/enterprise GPUs. | ||
3. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks | ||
4. Integration with other PyTorch native libraries like [torchtune](https://github.com/pytorch/torchtune) and [ExecuTorch](https://github.com/pytorch/executorch) | ||
5. [Custom C++/CUDA Extension support](./torchao/csrc/) | ||
### Quantization | ||
|
||
```python | ||
import torch | ||
import torchao | ||
|
||
# inductor settings which improve torch.compile performance for quantized modules | ||
torch._inductor.config.force_fuse_int_mm_with_mul = True | ||
torch._inductor.config.use_mixed_mm = True | ||
|
||
# Plug in your model and example input | ||
model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16) | ||
input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda') | ||
|
||
# perform autoquantization | ||
torchao.autoquant(model, (input)) | ||
|
||
# compile the model to recover performance | ||
model = torch.compile(model, mode='max-autotune') | ||
model(input) | ||
``` | ||
|
||
### Sparsity | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @jcaip |
||
|
||
```python | ||
import torch | ||
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor | ||
from torch.ao.pruning import WeightNormSparsifier | ||
|
||
# bfloat16 CUDA model | ||
model = torch.nn.Sequential(torch.nn.Linear(64, 64)).cuda().to(torch.bfloat16) | ||
|
||
# Accuracy: Finding a sparse subnetwork | ||
sparse_config = [] | ||
for name, mod in model.named_modules(): | ||
if isinstance(mod, torch.nn.Linear): | ||
sparse_config.append({"tensor_fqn": f"{name}.weight"}) | ||
|
||
sparsifier = WeightNormSparsifier(sparsity_level=1.0, | ||
sparse_block_shape=(1,4), | ||
zeros_per_block=2) | ||
|
||
# attach FakeSparsity | ||
sparsifier.prepare(model, sparse_config) | ||
sparsifier.step() | ||
sparsifier.squash_mask() | ||
# now we have dense model with sparse weights | ||
|
||
# Performance: Accelerated sparse inference | ||
for name, mod in model.named_modules(): | ||
if isinstance(mod, torch.nn.Linear): | ||
mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight)) | ||
``` | ||
|
||
To learn more try out our APIs, you can check out API examples in | ||
* [quantization](./torchao/quantization) | ||
* [sparsity](./torchao/sparsity) | ||
* [dtypes](./torchao/dtypes) | ||
|
||
|
||
## Supported Features | ||
1. [Quantization algorithms](./torchao/quantization) | ||
- Int4 weight-only quantization TODO: Where is this? | ||
|
||
- [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization | ||
- [Int4 weight-only](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu) quantization | ||
- [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) and [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) for low latency inference | ||
- High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs | ||
2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks | ||
3. Support for lower precision [dtypes](./torchao/dtypes) such as | ||
- [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code | ||
- [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py) | ||
4. [Bleeding Edge Kernels](./torchao/prototype/) for experimental kernels without backwards compatibility guarantees | ||
- [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning | ||
- [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads | ||
|
||
## Our Goals | ||
torchao embodies PyTorch’s design philosophy [details](https://pytorch.org/docs/stable/community/design.html), especially "usability over everything else". Our vision for this repository is the following: | ||
|
||
* Composability: Native solutions for optimization techniques that compose with both `torch.compile` and `FSDP` | ||
* For example, for QLoRA for new dtypes support | ||
* Interoperability: Work with the rest of the PyTorch ecosystem such as torchtune, gpt-fast and ExecuTorch | ||
* Transparent Benchmarks: Regularly run performance benchmarking of our APIs across a suite of Torchbench models and across hardware backends | ||
* Composability with `torch.compile`: We rely heavily on `torch.compile` to write pure PyTorch code and codegen efficient kernels. There are however limits to what a compiler can do so we don't shy away from writing our custom CUDA/Triton kernels | ||
* Composability with `FSDP`: The new support for FSDP per parameter sharding means engineers and researchers alike can experiment with different quantization and distributed strategies concurrently. | ||
* Performance: We measure our performance on every commit using an A10G. We also regularly run performance benchmarks on the [torchbench](https://github.com/pytorch/benchmark) suite | ||
* Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch). | ||
* Infrastructure Support: Release packaging solution for kernels and a CI/CD setup that runs these kernels on different backends. | ||
* Packaging kernels should be easy: We support custom [CUDA and Triton extensions](./torchao/csrc/) so you can focus on writing your kernels and we'll ensure that they work on most operating systems and devices | ||
|
||
## Interoperability with PyTorch Libraries | ||
## Integrations | ||
|
||
torchao has been integrated with other repositories to ease usage | ||
torchao has been integrated with other libraries including | ||
|
||
* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) is integrated with 8 and 4 bit weight-only quantization techniques with and without GPTQ. | ||
* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) is integrated with GPTQ for both 8da4w (int8 dynamic activation, with int4 weight) and int4 weight only quantization. | ||
* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) leverages our 8 and 4 bit weight-only quantization techniques with optional support for GPTQ | ||
* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) leverages our GPTQ implementation for both 8da4w (int8 dynamic activation with int4 weight) and int4 weight-only quantization. | ||
* [HQQ](https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py) leverages our int4mm kernel for low latency inference | ||
|
||
## Success stories | ||
Our kernels have been used to achieve SOTA inference performance on | ||
|
||
1. Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai) | ||
2. Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2) | ||
3. Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3) | ||
* Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai) | ||
* Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2) | ||
* Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3) | ||
|
||
## License | ||
|
||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have nightlies (https://pypi.org/project/torchao-nightly/)