pytorch · msaroufim · Apr 29, 2024 · Apr 29, 2024 · Apr 29, 2024 · Apr 29, 2024
diff --git a/README.md b/README.md
@@ -1,66 +1,133 @@
 # torchao: PyTorch Architecture Optimization
 
-**Note: This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an GitHub issue**
+[![](https://dcbadge.vercel.app/api/server/cudamode?style=flat)](discord.gg/cudamode)
+
+This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues)
 
 ## Introduction
-torchao is a PyTorch native library for optimizing your models using lower precision dtypes, techniques like quantization and sparsity and performant kernels.
+`torchao` is a PyTorch library for quantization and sparsity.
 
 ## Get Started
-To try out our APIs, you can check out API examples in [quantization](./torchao/quantization) (including `autoquant`), [sparsity](./torchao/sparsity), [dtypes](./torchao/dtypes).
 
-## Installation
-**Note: this library makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.**
+### Installation
+`torchao` makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.
 
-1. From PyPI:
+Stable Release
 ```Shell
 pip install torchao
 ```
 
-2. From Source:
+Nightly Release
+```Shell
+pip install torchao-nightly
+```
+
+From source
 
 ```Shell
-git clone https://github.com/pytorch-labs/ao
+git clone https://github.com/pytorch/ao
 cd ao
 python setup.py develop
 ```
 
-## Key Features
-The library provides
-1. Support for lower precision [dtypes](./torchao/dtypes) such as nf4, uint4 that are torch.compile friendly
-2. [Quantization algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile.
-  * Int8 dynamic activation quantization
-  * Int8 and int4 weight-only quantization
-  * Int8 dynamic activation quantization with int4 weight quantization
-  * [GPTQ](https://arxiv.org/abs/2210.17323) and [Smoothquant](https://arxiv.org/abs/2211.10438)
-  * High level `autoquant` API and kernel auto tuner targeting SOTA performance across varying model shapes on consumer/enterprise GPUs.
-3. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
-4. Integration with other PyTorch native libraries like [torchtune](https://github.com/pytorch/torchtune) and [ExecuTorch](https://github.com/pytorch/executorch)
-5. [Custom C++/CUDA Extension support](./torchao/csrc/)
+### Quantization
+
+```python
+import torch
+import torchao
+
+# inductor settings which improve torch.compile performance for quantized modules
+torch._inductor.config.force_fuse_int_mm_with_mul = True
+torch._inductor.config.use_mixed_mm = True
+
+# Plug in your model and example input
+model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
+input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')
+
+# perform autoquantization
+torchao.autoquant(model, (input))
+
+# compile the model to recover performance
+model = torch.compile(model, mode='max-autotune')
+model(input)
+```
+
+### Sparsity
+
+```python
+import torch
+from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
+from torch.ao.pruning import WeightNormSparsifier
+
+# bfloat16 CUDA model
+model = torch.nn.Sequential(torch.nn.Linear(64, 64)).cuda().to(torch.bfloat16)
+
+# Accuracy: Finding a sparse subnetwork
+sparse_config = []
+for name, mod in model.named_modules():
+   if isinstance(mod, torch.nn.Linear):
+      sparse_config.append({"tensor_fqn": f"{name}.weight"})
+
+sparsifier = WeightNormSparsifier(sparsity_level=1.0,
+                                 sparse_block_shape=(1,4),
+                                 zeros_per_block=2)
+
+# attach FakeSparsity
+sparsifier.prepare(model, sparse_config)
+sparsifier.step()
+sparsifier.squash_mask()
+# now we have dense model with sparse weights
+
+# Performance: Accelerated sparse inference
+for name, mod in model.named_modules():
+   if isinstance(mod, torch.nn.Linear):
+      mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight))
+```
+
+To learn more try out our APIs, you can check out API examples in
+* [quantization](./torchao/quantization)
+* [sparsity](./torchao/sparsity)
+* [dtypes](./torchao/dtypes)
+
+
+## Supported Features
+1. [Quantization algorithms](./torchao/quantization)
+    - Int4 weight-only quantization TODO: Where is this?
 
+    - [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization
+    - [Int4 weight-only](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu) quantization
+    - [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) and [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) for low latency inference
+    - High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs
+2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
+3. Support for lower precision [dtypes](./torchao/dtypes) such as
+    - [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code
+    - [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py)
+4. [Bleeding Edge Kernels](./torchao/prototype/) for experimental kernels without backwards compatibility guarantees
+    - [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning
+    - [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads
 
 ## Our Goals
-torchao embodies PyTorch’s design philosophy [details](https://pytorch.org/docs/stable/community/design.html), especially "usability over everything else". Our vision for this repository is the following:
 
-* Composability: Native solutions for optimization techniques that compose with both `torch.compile` and `FSDP`
-    * For example, for QLoRA for new dtypes support
-* Interoperability: Work with the rest of the PyTorch ecosystem such as torchtune, gpt-fast and ExecuTorch
-* Transparent Benchmarks: Regularly run performance benchmarking of our APIs across a suite of Torchbench models and across hardware backends
+* Composability with `torch.compile`: We rely heavily on `torch.compile` to write pure PyTorch code and codegen efficient kernels. There are however limits to what a compiler can do so we don't shy away from writing our custom CUDA/Triton kernels
+* Composability with `FSDP`: The new support for FSDP per parameter sharding means engineers and researchers alike can experiment with different quantization and distributed strategies concurrently.
+* Performance: We measure our performance on every commit using an A10G. We also regularly run performance benchmarks on the [torchbench](https://github.com/pytorch/benchmark) suite
 * Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch).
-* Infrastructure Support: Release packaging solution for kernels and a CI/CD setup that runs these kernels on different backends.
+* Packaging kernels should be easy: We support custom [CUDA and Triton extensions](./torchao/csrc/) so you can focus on writing your kernels and we'll ensure that they work on most operating systems and devices
 
-## Interoperability with PyTorch Libraries
+## Integrations
 
-torchao has been integrated with other repositories to ease usage
+torchao has been integrated with other libraries including
 
-* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) is integrated with 8 and 4 bit weight-only quantization techniques with and without GPTQ.
-* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) is integrated with GPTQ for both 8da4w (int8 dynamic activation, with int4 weight) and int4 weight only quantization.
+* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) leverages our 8 and 4 bit weight-only quantization techniques with optional support for GPTQ
+* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) leverages our GPTQ implementation for both 8da4w (int8 dynamic activation with int4 weight) and int4 weight-only quantization.
+* [HQQ](https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py) leverages our int4mm kernel for low latency inference
 
 ## Success stories
 Our kernels have been used to achieve SOTA inference performance on
 
-1. Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai)
-2. Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2)
-3. Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3)
+* Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai)
+* Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2)
+* Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3)
 
 ## License