diff --git a/docs/source/backends/arm-ethos-u/arm-ethos-u-quantization.md b/docs/source/backends/arm-ethos-u/arm-ethos-u-quantization.md index 3a8700a63e7..8427f540f77 100644 --- a/docs/source/backends/arm-ethos-u/arm-ethos-u-quantization.md +++ b/docs/source/backends/arm-ethos-u/arm-ethos-u-quantization.md @@ -10,6 +10,7 @@ The Arm Ethos-U delegate supports the following quantization schemes: - 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow). - Limited support for 16-bit quantization with 16-bit activations and 8-bit weights (a.k.a 16x8 quantization). This is under development. +- Partial quantization is *not* supported on the Ethos-U backend. The entire model must be quantized. ### Quantization API diff --git a/docs/source/backends/arm-vgf/arm-vgf-overview.md b/docs/source/backends/arm-vgf/arm-vgf-overview.md index 4d693354dbc..daf0e08648f 100644 --- a/docs/source/backends/arm-vgf/arm-vgf-overview.md +++ b/docs/source/backends/arm-vgf/arm-vgf-overview.md @@ -84,6 +84,8 @@ See [Partitioner API](arm-vgf-partitioner.md) for more information of the Partit The VGF quantizer supports [Post Training Quantization (PT2E)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) and [Quantization-Aware Training (QAT)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_qat.html). +Partial quantization is supported, allowing users to quantize only specific parts of the model while leaving others in floating-point. + For more information on quantization, see [Quantization](arm-vgf-quantization.md). ## Runtime Integration diff --git a/docs/source/backends/arm-vgf/arm-vgf-quantization.md b/docs/source/backends/arm-vgf/arm-vgf-quantization.md index 23f3246eb6b..1abf2f10f76 100644 --- a/docs/source/backends/arm-vgf/arm-vgf-quantization.md +++ b/docs/source/backends/arm-vgf/arm-vgf-quantization.md @@ -13,6 +13,25 @@ The quantization schemes supported by the VGF Backend are: Weight-only quantization is not currently supported on the VGF backend. +### Partial Quantization + +The VGF backend supports partial quantization, where only parts of the model +are quantized while others remain in floating-point. This can be useful for +models where certain layers are not well-suited for quantization or when a +balance between performance and accuracy is desired. + +For every node (op) in the graph, the quantizer looks at the *quantization +configuration* set for that specific node. If the configuration is set to +`None`, the node is left in floating-point; if it is provided (not `None`), the +node is quantized according to that configuration. + +With the [Quantization API](#quantization-api), users can specify the +quantization configurations for specific layers or submodules of the model. The +`set_global` method is first used to set a default quantization configuration +(could be `None` as explained above) for all nodes in the model. Then, +configurations for specific layers or submodules can override the global +setting using the `set_module_name` or `set_module_type` methods. + ### Quantization API ```python diff --git a/docs/source/backends/arm-vgf/tutorials/vgf-getting-started.md b/docs/source/backends/arm-vgf/tutorials/vgf-getting-started.md index fe4a019528d..f977e1e67d3 100644 --- a/docs/source/backends/arm-vgf/tutorials/vgf-getting-started.md +++ b/docs/source/backends/arm-vgf/tutorials/vgf-getting-started.md @@ -78,13 +78,17 @@ The example below shows how to quantize a model consisting of a single addition, ```python import torch -class Add(torch.nn.Module): +class AddSigmoid(torch.nn.Module): + def __init__(self): + super().__init__() + self.sigmoid = torch.nn.Sigmoid() + def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: - return x + y + return self.sigmoid(x + y) example_inputs = (torch.ones(1,1,1,1),torch.ones(1,1,1,1)) -model = Add() +model = AddSigmoid() model = model.eval() exported_program = torch.export.export(model, example_inputs) graph_module = exported_program.graph_module @@ -98,13 +102,19 @@ from executorch.backends.arm.vgf import VgfCompileSpec from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e # Create a compilation spec describing the target for configuring the quantizer -compile_spec = VgfCompileSpec("TOSA-1.0+INT") +compile_spec = VgfCompileSpec() # Create and configure quantizer to use a symmetric quantization config globally on all nodes quantizer = VgfQuantizer(compile_spec) operator_config = get_symmetric_quantization_config(is_per_channel=False) + +# Set default quantization config for the layers in the models. +# Can also be set to `None` to let layers run in FP as default. quantizer.set_global(operator_config) +# OPTIONAL: skip quantizing all sigmoid ops (only one for this model); let it run in FP +quantizer.set_module_type(torch.nn.Sigmoid, None) + # Post training quantization quantized_graph_module = prepare_pt2e(graph_module, quantizer) quantized_graph_module(*example_inputs) # Calibrate the graph module with the example input