Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New README #392

Merged
merged 20 commits into from
Jun 19, 2024
210 changes: 101 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,88 @@

[![](https://dcbadge.vercel.app/api/server/cudamode?style=flat)](https://discord.gg/cudamode)

This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues)

## Introduction
`torchao` is a PyTorch library for quantization and sparsity.

torchao is a library which makes it easy to integrate and create high performance kernels with custom data types and layouts with up to
msaroufim marked this conversation as resolved.
Show resolved Hide resolved
msaroufim marked this conversation as resolved.
Show resolved Hide resolved
* **30% speedups** for training
* **2x speedups** with **65%** less VRAM for inference

All with no intrusive code changes and minimal accuracy degradation.

## Benchmarks

### Inference

#### Without intrusive code changes

Quantizing your own models is as simple as the below and this should work on any model with `nn.Linear`. You can find a more comprehensive usage example [here](torchao/quantization/)

```python
from torchao.quantization.quant_api import quantize
m = quantize(m, "int4wo")
```

Benchmarks are run on a machine with a single A100 GPU using the script in `_models/llama` which generates text in a latency optimized way (batchsize=1)

The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama-3-8B`.

| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
| ----------- | ------------------ | ------------------- | ------------- | ----------------------- | ---------------- | --------------- |
| Llama-2-7B | Base (bfloat16) | 12.212 | 105.02 | 1387.78 | 13.21 | 13.90 |
| | int8dq | 12.262 | 9.40 | 62.26 | 6.62 | 8.61 |
| | int8wo | 12.204 | 147.03 | 973.54 | 6.62 | 8.95 |
| | int4wo-64 | 12.843 | 199.81 | 746.45 | 3.74 | 4.75 |
| | int4wo-64-GPTQ | 12.489 | 199.81 | 746.45 | 3.74 | 4.75 |
| Llama-3-8B | Base (bfloat16) | N/A | 94.91 | 1424.58 | 15.01 | 16.43 |
| | int8dq | N/A | 8.41 | 63.23 | 7.52 | 9.24 |
| | int8wo | N/A | 136.75 | 1028.38 | 7.52 | 10.42 |
| | int4wo-64 | N/A | 179.41 | 757.45 | 4.22 | 6.88 |

note: Int8 dynamic quantization works best on compute bound models like [SAM](https://github.com/pytorch-labs/segment-anything-fast) whereas Llama with batchsize=1 tends to be memory bound, thus the rather low performance.
msaroufim marked this conversation as resolved.
Show resolved Hide resolved
msaroufim marked this conversation as resolved.
Show resolved Hide resolved

And a quick crash course on inference quantization to help parse the above table. Int4 quantization is actually an ambiguous term because there's the dtype in which a layer is represented and then the dtype in which the computation is done. For example if you're using Weight-Only (wo) int4 quantization that means that the layer will be upcasted to a larger dtype like fp16 so an int4 matrix multiplication is defined as `F.linear(input, weight.to(input.dtype))` wheras if it's possible to perform the computation using the smaller dtype directly pending support by a hardware vendor then that means you can perform `F.linear(input, weight)` directly and this is what we refer to as Dynamic-Quantization (dq). Naive quantization algorithms are also notoriously sensitive to outliers so we also typically set a group size that applies a scale factor per group of 64 elements in the case of `int4wo64`.


#### With intrusive code changes

In some cases we rewrote popular GenAI models to be significantly faster in native PyTorch as in no C++/CUDA to achieve at the time SOTA inference performance. These involve more intrusive code changes.

* 8x speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)
* 10x speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)
* 3x speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3)

### Training

We've added support for semi-structured 2:4 sparsity with over 30% speedups on ViT-L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, the speedup here is not for the entire model. I believe its specifically for the MLP blocks in ViT. cc @jcaip

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put together anything from an end to end perspective


The code change is a 1 liner with the full example available [here](torchao/sparsity/training/)


```python
swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear})
```

For VIT-L MLP shapes on a NVIDIA A100 we see the following results:

| | act24 | dense | w24 | s24_inp_sparsify24 | s24_inp_clone |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed this with @jcaip offline but would be good to make the columns here more clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcaip please send me a clearer table when you have it - keep in mind I want to make some claim about end to end performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm making some changes to the benchmarking script for e2e ViT. Will send something by EOD.

|---------------------|-----------|-----------|----------|--------------------|---------------|
| f16 (44160,1024,4096,1024) | 11881.0 | 11534.3 | 9204.7 | 255.1 | 125.8 |

Times are in microseconds (us).

## Newer dtypes
msaroufim marked this conversation as resolved.
Show resolved Hide resolved

[MX](https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.

[nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk [here](https://x.com/HamelHusain/status/1800315287574847701)

## Composability

A key design principle for us is composability as in any new dtype or layout we provide needs to work with `torch.compile()` and it needs to work with `FSDP`. It shouldn't matter if the kernels are written are pure PyTorch, CUDA, C++ or Triton - things should just work! And here has been our current strategy
1. Write the dtype, layout or bit packing logic in pure PyTorch and codegenerate efficient kernels with torch.compile. You can inspect those kernels with `TORCH_LOGS="output_code" python your_code.py` and check if a single kernel is being generated and if any unecessary buffers are being created
2. However once you get a kernel, how do you know how good it is? The best way is to benchmark the codegenerated code with the best kernel on the market. But packaging custom CPP/CUDA kernels that work on multiple devices is tedious but we've abstracted all the tedium from you with our [custom ops support](./torchao/csrc/) so if you love writing kernels but hate packaging, we'd love to accept contributions for your custom ops. One key benefit is a kernel written as a custom op will just work with no graph breaks with `torch.compile()`. Compilers are great at optimizations like fusions and overhead reduction but it's challenging for compilers to rewrite the math of an algorithm such that's it's faster but also numerically stable so we are betting on both compilers and custom ops
3. Finally while historically most quantization has been done for inference there is now a thriving area of research combining lower dtypes and sharding. One popular example is [NF4](torchao/dtypes/nf4tensor.py) which is used to create the QLoRA algorithm and you can define the semantics for how custom tensors should be sharded over multiple devices. We gave an accessible talk on [how to do this](https://x.com/HamelHusain/status/1800315287574847701).

## Get Started

Expand All @@ -14,135 +92,49 @@ This repository is currently under heavy development - if you have suggestions o

Stable Release
```Shell
pip install torchao
pip install torchao --extra-index-url https://download.pytorch.org/whl/test/cu121 # full options are cpu/cu118/cu121/cu124
```

Nightly Release
```Shell
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cpu # CPU only builds
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu118 # CUDA 11.8
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu121 # CUDA 12.1
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu124 # CUDA 12.4

pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124
```

From source
## Community Contributions

* [jeromeku](https://github.com/jeromeku) has implemented
* [GaLore](torchao/prototype/galore/) a drop for the Adam Optimizer that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch
* [DoRA](torchao/prototype/dora) a newer replacement for QLoRA with more promising convergence characteristics
* [Fused int4/fp16 Quant Matmul](torchao/prototype/hqq) which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512
gau-nernst marked this conversation as resolved.
Show resolved Hide resolved
* [gau-nernst](https://github.com/gau-nernst) fp6 kernels that are 4x faster than fp16 [torchao/prototype/fp6_llm](torchao/prototype/fp6_llm)
* [vayuda](https://github.com/vayuda) with generic bitpacking kernels that were codegenerated using pure PyTorch [prototype/common](torchao/prototype/common)

## How to contribute

This repository is currently under heavy development
* If you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues)
* If you'd like to co-develop the library with us please join us on #torchao on [discord.gg/cudamode](https://discord.gg/cudamode) - there's a lot of dtypes out there and we could use a lot more hands to make them go brrr

Installation instructions

```Shell
git clone https://github.com/pytorch/ao
cd ao
python setup.py install
python setup.py install
```

If you plan to be developing the library run:
If you're contributing a feature ao
```Shell
pip install -r dev-requirements.txt
python setup.py develop
```

** Note:
If you are running into any issues while building `ao` cpp extensions you can instead build using
For *most* developers you probably want to skip building custom C++/CUDA extensions for faster iteration cycles

```shell
USE_CPP=0 python setup.py install
```

### Quantization

```python
import torch
import torchao

# inductor settings which improve torch.compile performance for quantized modules
torch._inductor.config.force_fuse_int_mm_with_mul = True
torch._inductor.config.use_mixed_mm = True

# Plug in your model and example input
model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')

# perform autoquantization and compilation
q_model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
q_model(input)
```

### Sparsity

```python
import torch
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
from torch.ao.pruning import WeightNormSparsifier

# bfloat16 CUDA model
model = torch.nn.Sequential(torch.nn.Linear(64, 64)).cuda().to(torch.bfloat16)

# Accuracy: Finding a sparse subnetwork
sparse_config = []
for name, mod in model.named_modules():
if isinstance(mod, torch.nn.Linear):
sparse_config.append({"tensor_fqn": f"{name}.weight"})

sparsifier = WeightNormSparsifier(sparsity_level=1.0,
sparse_block_shape=(1,4),
zeros_per_block=2)

# attach FakeSparsity
sparsifier.prepare(model, sparse_config)
sparsifier.step()
sparsifier.squash_mask()
# now we have dense model with sparse weights

# Performance: Accelerated sparse inference
for name, mod in model.named_modules():
if isinstance(mod, torch.nn.Linear):
mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight))
```

To learn more try out our APIs, you can check out API examples in
* [quantization](./torchao/quantization)
* [sparsity](./torchao/sparsity)
* [dtypes](./torchao/dtypes)


## Supported Features
1. [Quantization algorithms](./torchao/quantization)
- [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization
- [Int4 weight-only](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu) quantization
- [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) and [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) for low latency inference
- High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs
2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
3. Support for lower precision [dtypes](./torchao/dtypes) such as
- [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code
- [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py)
- [MX](https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.
4. [Bleeding Edge Kernels](./torchao/prototype/) for experimental kernels without backwards compatibility guarantees
- [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning
- [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads
- [FP6-LLM](torchao/prototype/fp6_llm) mixed matmul FP16 x FP6 kernel for io bound workloads

## Our Goals

* Composability with `torch.compile`: We rely heavily on `torch.compile` to write pure PyTorch code and codegen efficient kernels. There are however limits to what a compiler can do so we don't shy away from writing our custom CUDA/Triton kernels
* Composability with `FSDP`: The new support for FSDP per parameter sharding means engineers and researchers alike can experiment with different quantization and distributed strategies concurrently.
* Performance: We measure our performance on every commit using an A10G. We also regularly run performance benchmarks on the [torchbench](https://github.com/pytorch/benchmark) suite
* Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch).
* Packaging kernels should be easy: We support custom [CUDA and Triton extensions](./torchao/csrc/) so you can focus on writing your kernels and we'll ensure that they work on most operating systems and devices

## Integrations

torchao has been integrated with other libraries including

* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) leverages our 8 and 4 bit weight-only quantization techniques with optional support for GPTQ
* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) leverages our GPTQ implementation for both 8da4w (int8 dynamic activation with int4 weight) and int4 weight-only quantization.
* [HQQ](https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py) leverages our int4mm kernel for low latency inference

## Success stories
Our kernels have been used to achieve SOTA inference performance on

* Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)
* Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)
* Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3)

## License

`torchao` is released under the [BSD 3](https://github.com/pytorch-labs/ao/blob/main/LICENSE) license.
Loading