Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ The Arm Ethos-U delegate supports the following quantization schemes:

- 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow).
- Limited support for 16-bit quantization with 16-bit activations and 8-bit weights (a.k.a 16x8 quantization). This is under development.
- Partial quantization is *not* supported on the Ethos-U backend. The entire model must be quantized.

### Quantization API

Expand Down
2 changes: 2 additions & 0 deletions docs/source/backends/arm-vgf/arm-vgf-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@ See [Partitioner API](arm-vgf-partitioner.md) for more information of the Partit
The VGF quantizer supports [Post Training Quantization (PT2E)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html)
and [Quantization-Aware Training (QAT)](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_qat.html).

Partial quantization is supported, allowing users to quantize only specific parts of the model while leaving others in floating-point.

For more information on quantization, see [Quantization](arm-vgf-quantization.md).

## Runtime Integration
Expand Down
19 changes: 19 additions & 0 deletions docs/source/backends/arm-vgf/arm-vgf-quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,25 @@ The quantization schemes supported by the VGF Backend are:

Weight-only quantization is not currently supported on the VGF backend.

### Partial Quantization

The VGF backend supports partial quantization, where only parts of the model
are quantized while others remain in floating-point. This can be useful for
models where certain layers are not well-suited for quantization or when a
balance between performance and accuracy is desired.

For every node (op) in the graph, the quantizer looks at the *quantization
configuration* set for that specific node. If the configuration is set to
`None`, the node is left in floating-point; if it is provided (not `None`), the
node is quantized according to that configuration.

With the [Quantization API](#quantization-api), users can specify the
quantization configurations for specific layers or submodules of the model. The
`set_global` method is first used to set a default quantization configuration
(could be `None` as explained above) for all nodes in the model. Then,
configurations for specific layers or submodules can override the global
setting using the `set_module_name` or `set_module_type` methods.

### Quantization API

```python
Expand Down
18 changes: 14 additions & 4 deletions docs/source/backends/arm-vgf/tutorials/vgf-getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,13 +78,17 @@ The example below shows how to quantize a model consisting of a single addition,
```python
import torch

class Add(torch.nn.Module):
class AddSigmoid(torch.nn.Module):
def __init__(self):
super().__init__()
self.sigmoid = torch.nn.Sigmoid()

def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
return x + y
return self.sigmoid(x + y)

example_inputs = (torch.ones(1,1,1,1),torch.ones(1,1,1,1))

model = Add()
model = AddSigmoid()
model = model.eval()
exported_program = torch.export.export(model, example_inputs)
graph_module = exported_program.graph_module
Expand All @@ -98,13 +102,19 @@ from executorch.backends.arm.vgf import VgfCompileSpec
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e

# Create a compilation spec describing the target for configuring the quantizer
compile_spec = VgfCompileSpec("TOSA-1.0+INT")
compile_spec = VgfCompileSpec()

# Create and configure quantizer to use a symmetric quantization config globally on all nodes
quantizer = VgfQuantizer(compile_spec)
operator_config = get_symmetric_quantization_config(is_per_channel=False)

# Set default quantization config for the layers in the models.
# Can also be set to `None` to let layers run in FP as default.
quantizer.set_global(operator_config)

# OPTIONAL: skip quantizing all sigmoid ops (only one for this model); let it run in FP
quantizer.set_module_type(torch.nn.Sigmoid, None)

# Post training quantization
quantized_graph_module = prepare_pt2e(graph_module, quantizer)
quantized_graph_module(*example_inputs) # Calibrate the graph module with the example input
Expand Down
Loading