Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #140

Merged
merged 2 commits into from
Apr 16, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 43 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,36 @@

**Note: This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an github issue**

The `torchao` package allows you to quantize and prune your models using native PyTorch.
## Introduction

The repo hosts both
1. lower precision [dtypes](./torchao/dtypes) such as nf4, uint4
2. Quantization [algorithms](./torchao/quantization) such as dynamic quant, smoothquant
3. Sparsity [algorithms](./torchao/sparsity) such as Wanda
torchao is a PyTorch native library for optimizing your models using lower precision dtypes, techniques like quantization and sparsity and performant kernels.

The library provides
1. Support for lower precision [dtypes](./torchao/dtypes) such as nf4, uint4 that are torch.compile friendly
2. Quantization [algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile.
3. Sparsity [algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
4. Integration with other PyTorch native libraries like torchtune and ExecuTorch

## Key Features
* Native PyTorch techniques, composable with torch.compile
* High level `autoquant` API and kernel auto tuner targeting SOTA performance across varying model shapes on consumer/enterprise GPUs.
* Quantization techniques and kernels that work with both eager and torch.compile
* Int8 dynamic activation quantization
* Int8 and int4 weight-only quantization
* Int8 dynamic activation quantization with int4 weight quantization
* [GPTQ](https://arxiv.org/abs/2210.17323) and [Smoothquant](https://arxiv.org/abs/2211.10438)

## Interoperability with PyTorch Libraries

torchao has been integrated with other repositories to ease usage

* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) is integrated with 8 and 4 bit weight-only quantization techniques with and without GPTQ.
* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) is integrated with GPTQ for both 8da4w (int8 dynamic activation, with int4 weight) and int4 weight only quantization.

## Success stories
Our kernels have has been used to achieve SOTA inference performance on

1. Image segmentation modelss with [sam-fast](pytorch.org/blog/accelerating-generative-ai)
1. Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai)
2. Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2)
3. Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3)

Expand All @@ -34,10 +53,23 @@ cd ao
pip install -e .
```

## Our Goals
torchao embodies PyTorch’s design philosophy [details](https://pytorch.org/docs/stable/community/design.html), especially "usability over everything else". Our vision for this repository is the following:

* Composability: Native solutions for optimization techniques that compose with both `torch.compile` and `FSDP`
* For example, for QLoRA for new dtypes support
* Interoperability: Work with the rest of the PyTorch ecosystem such as torchtune, gpt-fast and ExecuTorch
* Transparent Benchmarks: Regularly run performance benchmarking of our APIs across a suite of Torchbench models and across hardware backends
* Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch).
* Infrastructure Support: Release packaging solution for kernels and a CI/CD setup that runs these kernels on different backends.



## Examples

Typically quantization algorithms will have different schemes for how the activation and weights are quantized so A16W8 for instance means the activations are quantized to 16 bits wheras the weights are quantized to 8 bits. Trying out different quantization schemes in `torchao` is generally a 1 line change.


### Autoquantization

The `autoquant` api can be used to quickly and accurately quantize your model. When used as in the example below, the api first identifies the shapes
Expand Down Expand Up @@ -133,10 +165,12 @@ model = torch.compile(model, mode='max-autotune')
model(input)
```

## Sharp edges

1. While these techniques are designed to improve model performance, in some cases the opposite can occur. This is because quantization adds additional overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization) or loading weights faster (weight-only quantization). If your matmuls are small enough or your non-quantized perf isn't bottlenecked by weight load time, these techniques may reduce performance.
2. Use the PyTorch nightlies so you can leverage [tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) which is preferred over older module swap based methods because it doesn't modify the graph and is generally more composable and flexible.
## Notes

1. APIs have been hardware tested on A100 and T4(colab)
2. While these techniques are designed to improve model performance, in some cases the opposite can occur. This is because quantization adds additional overhead to the model that is hopefully made up for by faster matmuls (dynamic quantization) or loading weights faster (weight-only quantization). If your matmuls are small enough or your non-quantized perf isn't bottlenecked by weight load time, these techniques may reduce performance.
3. Use the PyTorch nightlies so you can leverage [tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) which is preferred over older module swap based methods because it doesn't modify the graph and is generally more composable and flexible.


## License
Expand Down
Loading