From e1873430fb9d645bad0083a83b65455cc21d1645 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 29 Apr 2024 10:48:33 -0700 Subject: [PATCH 1/7] Update README --- README.md | 128 ++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 95 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index 275f9a5887..334f4b89d7 100644 --- a/README.md +++ b/README.md @@ -1,65 +1,127 @@ # torchao: PyTorch Architecture Optimization -**Note: This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an GitHub issue** +[https://dcbadge.vercel.app/api/server/cudamode?style=flat](discord.gg/cudamode) + +This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues) ## Introduction -torchao is a PyTorch native library for optimizing your models using lower precision dtypes, techniques like quantization and sparsity and performant kernels. +`torchao` is a PyTorch library for quantization and sparsity that compose out of the box with `torch.compile` and `FSDP` ## Get Started -To try out our APIs, you can check out API examples in [quantization](./torchao/quantization) (including `autoquant`), [sparsity](./torchao/sparsity), [dtypes](./torchao/dtypes). -## Installation -**Note: this library makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.** +### Installation +`torchao` makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch. -1. From PyPI: +Stable Release ```Shell pip install torchao ``` -2. From Source: +From source ```Shell -git clone https://github.com/pytorch-labs/ao +git clone https://github.com/pytorch/ao cd ao -pip install -e . +python setup.py develop ``` -## Key Features -The library provides -1. Support for lower precision [dtypes](./torchao/dtypes) such as nf4, uint4 that are torch.compile friendly -2. [Quantization algorithms](./torchao/quantization) such as dynamic quant, smoothquant, GPTQ that run on CPU/GPU and Mobile. - * Int8 dynamic activation quantization - * Int8 and int4 weight-only quantization - * Int8 dynamic activation quantization with int4 weight quantization - * [GPTQ](https://arxiv.org/abs/2210.17323) and [Smoothquant](https://arxiv.org/abs/2211.10438) - * High level `autoquant` API and kernel auto tuner targeting SOTA performance across varying model shapes on consumer/enterprise GPUs. -3. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks -4. Integration with other PyTorch native libraries like [torchtune](https://github.com/pytorch/torchtune) and [ExecuTorch](https://github.com/pytorch/executorch) +### Quantization + +```python +import torch +import torchao + +# inductor settings which improve torch.compile performance for quantized modules +torch._inductor.config.force_fuse_int_mm_with_mul = True +torch._inductor.config.use_mixed_mm = True + +# Plug in your model and example input +model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16) +input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda') + +# perform autoquantization +torchao.autoquant(model, (input)) + +# compile the model to improve performance +model = torch.compile(model, mode='max-autotune') +model(input) +``` + +### Sparsity + +```python +import torch +from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor +from torch.ao.pruning import WeightNormSparsifier +# bfloat16 CUDA model +model = model.half().cuda() + +# Accuracy: Finding a sparse subnetwork +sparse_config = [] +for name, mod in model.named_modules(): + if isinstance(mod, torch.nn.Linear): + sparse_config.append({"tensor_fqn": f"{name}.weight"}) + +sparsifier = WeightNormSparsifier(sparsity_level=1.0, + sparse_block_shape=(1,4), + zeros_per_block=2) + +# attach FakeSparsity +sparsifier.prepare(model, sparse_config) +sparsifier.step() +sparsifier.squash_mask() +# now we have dense model with sparse weights + +# Performance: Accelerated sparse inference +for name, mod in model.named_modules(): + if isinstance(mod, torch.nn.Linear): + mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight)) +``` + +To learn more try out our APIs, you can check out API examples in +* [quantization](./torchao/quantization) +* [sparsity](./torchao/sparsity) +* [dtypes](./torchao/dtypes) + + +## Supported Features +1. [Quantization algorithms](./torchao/quantization) + * [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization + * Int8 dynamic activation quantization with int4 weight quantization + * [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) + * [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) + * High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs +2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks +3. Support for lower precision [dtypes](./torchao/dtypes) such as + * [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code + * [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py) +4. [Bleeding Edge Kernels](./torchao/prototype/) + * [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning + * [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads ## Our Goals -torchao embodies PyTorch’s design philosophy [details](https://pytorch.org/docs/stable/community/design.html), especially "usability over everything else". Our vision for this repository is the following: -* Composability: Native solutions for optimization techniques that compose with both `torch.compile` and `FSDP` - * For example, for QLoRA for new dtypes support -* Interoperability: Work with the rest of the PyTorch ecosystem such as torchtune, gpt-fast and ExecuTorch -* Transparent Benchmarks: Regularly run performance benchmarking of our APIs across a suite of Torchbench models and across hardware backends +* Composability with `torch.compile`: We rely heavily on `torch.compile` to write pure PyTorch code and codegen efficient kernels. There are however limits to what a compiler can do so we don't shy away from writing our custom CUDA/Triton kernels +* Composability with `FSDP`: The new support for FSDP per parameter sharding means engineers and researchers alike can experiment with different quantization and distributed strategies concurrently. +* Performance: We measure our performance on every commit using an A10G. We also regularly run performance benchmarks on the [torchbench](https://github.com/pytorch/benchmark) suite * Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch). -* Infrastructure Support: Release packaging solution for kernels and a CI/CD setup that runs these kernels on different backends. +* Packaging kernels should be easy: We support custom [CUDA and Triton extensions](./torchao/csrc/) so you can focus on writing your kernels and we'll ensure that they work on most operating systems and devices -## Interoperability with PyTorch Libraries +## Integrations -torchao has been integrated with other repositories to ease usage +torchao has been integrated with other libraries including -* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) is integrated with 8 and 4 bit weight-only quantization techniques with and without GPTQ. -* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) is integrated with GPTQ for both 8da4w (int8 dynamic activation, with int4 weight) and int4 weight only quantization. +* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) leverages our 8 and 4 bit weight-only quantization techniques with optional support for GPTQ +* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) leverages our GPTQ implementation for both 8da4w (int8 dynamic activation with int4 weight) and int4 weight-only quantization. +* [HQQ](https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py) leverages our int4mm kernel for low latency inference ## Success stories Our kernels have been used to achieve SOTA inference performance on -1. Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai) -2. Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2) -3. Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3) +* Image segmentation models with [sam-fast](pytorch.org/blog/accelerating-generative-ai) +* Language models with [gpt-fast](pytorch.org/blog/accelerating-generative-ai-2) +* Diffusion models with [sd-fast](pytorch.org/blog/accelerating-generative-ai-3) ## License From c7981387b99dd82f8b364f7f7b07390c8c5e2f74 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 29 Apr 2024 10:56:33 -0700 Subject: [PATCH 2/7] yolo --- README.md | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 334f4b89d7..d88819d406 100644 --- a/README.md +++ b/README.md @@ -87,18 +87,19 @@ To learn more try out our APIs, you can check out API examples in ## Supported Features 1. [Quantization algorithms](./torchao/quantization) - * [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization - * Int8 dynamic activation quantization with int4 weight quantization - * [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) - * [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) - * High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs + - Int4 weight-only quantization TODO: Where is this? + + - [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization + - Int8 dynamic activation quantization with int4 weight quantization TODO: Where is this? + - [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) and [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) for low latency inference + - High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs 2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks 3. Support for lower precision [dtypes](./torchao/dtypes) such as - * [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code - * [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py) -4. [Bleeding Edge Kernels](./torchao/prototype/) - * [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning - * [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads + - [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code + - [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py) TODO: What is this useful for? +4. [Bleeding Edge Kernels](./torchao/prototype/) for experimental kernels without backwards compatibility guarantees + - [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning + - [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads ## Our Goals From 96ba7184b9ba395860c2a79911e2ab53ab3d0dc4 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 29 Apr 2024 10:57:54 -0700 Subject: [PATCH 3/7] yolo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d88819d406..0e35abfa98 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues) ## Introduction -`torchao` is a PyTorch library for quantization and sparsity that compose out of the box with `torch.compile` and `FSDP` +`torchao` is a PyTorch library for quantization and sparsity. ## Get Started From b36adf3cde9e8d0d49cc18bc0057483b52c8a67b Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 29 Apr 2024 11:04:10 -0700 Subject: [PATCH 4/7] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 0e35abfa98..22e72d774d 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # torchao: PyTorch Architecture Optimization -[https://dcbadge.vercel.app/api/server/cudamode?style=flat](discord.gg/cudamode) +[![](https://dcbadge.vercel.app/api/server/cudamode?style=flat)](discord.gg/cudamode) This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues) From 30627b5655605ce833b805d977b7a23ef44bf084 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 29 Apr 2024 11:35:50 -0700 Subject: [PATCH 5/7] yolo --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 22e72d774d..257e846f1e 100644 --- a/README.md +++ b/README.md @@ -17,6 +17,11 @@ Stable Release pip install torchao ``` +Nightly Release +```Shell +pip install torchao-nightly +``` + From source ```Shell @@ -42,7 +47,7 @@ input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda') # perform autoquantization torchao.autoquant(model, (input)) -# compile the model to improve performance +# compile the model to recover performance model = torch.compile(model, mode='max-autotune') model(input) ``` From 815c9ad519f81e3ce3ec5a588472bd09735aa399 Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 29 Apr 2024 11:41:51 -0700 Subject: [PATCH 6/7] yolo --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 257e846f1e..10b082e2ae 100644 --- a/README.md +++ b/README.md @@ -95,13 +95,13 @@ To learn more try out our APIs, you can check out API examples in - Int4 weight-only quantization TODO: Where is this? - [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization - - Int8 dynamic activation quantization with int4 weight quantization TODO: Where is this? + - [Int4 weight-only](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu) quantization - [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) and [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) for low latency inference - High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs 2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks 3. Support for lower precision [dtypes](./torchao/dtypes) such as - [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code - - [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py) TODO: What is this useful for? + - [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py) 4. [Bleeding Edge Kernels](./torchao/prototype/) for experimental kernels without backwards compatibility guarantees - [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning - [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads From 5ca057434aa902131c080f61c7bf64933a4d23ab Mon Sep 17 00:00:00 2001 From: Mark Saroufim Date: Mon, 29 Apr 2024 13:07:55 -0700 Subject: [PATCH 7/7] yolo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 10b082e2ae..dc8fe1f8b3 100644 --- a/README.md +++ b/README.md @@ -60,7 +60,7 @@ from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor from torch.ao.pruning import WeightNormSparsifier # bfloat16 CUDA model -model = model.half().cuda() +model = torch.nn.Sequential(torch.nn.Linear(64, 64)).cuda().to(torch.bfloat16) # Accuracy: Finding a sparse subnetwork sparse_config = []