Arm backend: Update docs to mention partial quantization (#16291)

martinlsm · Martin Lindström · web-flow · commit b4734c87e94d · 2025-12-18T13:28:29.000+01:00
VGF has now support for partial quantization, i.e., having the model run
in mixed numerical precision. Update the markdown documentation to
include and explain this feature.

Signed-off-by: Martin Lindström &lt;Martin.Lindstroem@arm.com&gt;
Co-authored-by: Martin Lindström &lt;Martin.Lindstroem@arm.com&gt;
diff --git a/docs/source/backends/arm-ethos-u/arm-ethos-u-quantization.md b/docs/source/backends/arm-ethos-u/arm-ethos-u-quantization.md
@@ -10,6 +10,7 @@ The Arm Ethos-U delegate supports the following quantization schemes:
 
 - 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow).
 - Limited support for 16-bit quantization with 16-bit activations and 8-bit weights (a.k.a 16x8 quantization). This is under development.
+- Partial quantization is *not* supported on the Ethos-U backend. The entire model must be quantized.
 
 ### Quantization API
 
diff --git a/docs/source/backends/arm-vgf/arm-vgf-overview.md b/docs/source/backends/arm-vgf/arm-vgf-overview.md
@@ -84,6 +84,8 @@ See [Partitioner API](arm-vgf-partitioner.md) for more information of the Partit
 The VGF quantizer supports [Post Training Quantization (PT2E)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html)
 and [Quantization-Aware Training (QAT)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_qat.html).
 
+Partial quantization is supported, allowing users to quantize only specific parts of the model while leaving others in floating-point.
+
 For more information on quantization, see [Quantization](arm-vgf-quantization.md).
 
 ## Runtime Integration
diff --git a/docs/source/backends/arm-vgf/arm-vgf-quantization.md b/docs/source/backends/arm-vgf/arm-vgf-quantization.md
@@ -13,6 +13,25 @@ The quantization schemes supported by the VGF Backend are:
 
 Weight-only quantization is not currently supported on the VGF backend.
 
+### Partial Quantization
+
+The VGF backend supports partial quantization, where only parts of the model
+are quantized while others remain in floating-point. This can be useful for
+models where certain layers are not well-suited for quantization or when a
+balance between performance and accuracy is desired.
+
+For every node (op) in the graph, the quantizer looks at the *quantization
+configuration* set for that specific node. If the configuration is set to
+`None`, the node is left in floating-point; if it is provided (not `None`), the
+node is quantized according to that configuration.
+
+With the [Quantization API](#quantization-api), users can specify the
+quantization configurations for specific layers or submodules of the model. The
+`set_global` method is first used to set a default quantization configuration
+(could be `None` as explained above) for all nodes in the model. Then,
+configurations for specific layers or submodules can override the global
+setting using the `set_module_name` or `set_module_type` methods.
+
 ### Quantization API
 
 ```python
diff --git a/docs/source/backends/arm-vgf/tutorials/vgf-getting-started.md b/docs/source/backends/arm-vgf/tutorials/vgf-getting-started.md
@@ -78,13 +78,17 @@ The example below shows how to quantize a model consisting of a single addition,
 ```python
 import torch
 
-class Add(torch.nn.Module):
+class AddSigmoid(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.sigmoid = torch.nn.Sigmoid()
+
     def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
-        return x + y
+        return self.sigmoid(x + y)
 
 example_inputs = (torch.ones(1,1,1,1),torch.ones(1,1,1,1))
 
-model = Add()
+model = AddSigmoid()
 model = model.eval()
 exported_program = torch.export.export(model, example_inputs)
 graph_module = exported_program.graph_module
@@ -98,13 +102,19 @@ from executorch.backends.arm.vgf import VgfCompileSpec
 from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
 
 # Create a compilation spec describing the target for configuring the quantizer
-compile_spec = VgfCompileSpec("TOSA-1.0+INT")
+compile_spec = VgfCompileSpec()
 
 # Create and configure quantizer to use a symmetric quantization config globally on all nodes
 quantizer = VgfQuantizer(compile_spec)
 operator_config = get_symmetric_quantization_config(is_per_channel=False)
+
+# Set default quantization config for the layers in the models.
+# Can also be set to `None` to let layers run in FP as default.
 quantizer.set_global(operator_config)
 
+# OPTIONAL: skip quantizing all sigmoid ops (only one for this model); let it run in FP
+quantizer.set_module_type(torch.nn.Sigmoid, None)
+
 # Post training quantization
 quantized_graph_module = prepare_pt2e(graph_module, quantizer)
 quantized_graph_module(*example_inputs) # Calibrate the graph module with the example input