Feat (export): Remove QOP Export (Xilinx#917)

Giuseppe5 · Apr 12, 2024 · a5a0822 · a5a0822
1 parent ea1cbc8
commit a5a0822
Show file tree

Hide file tree

Showing 24 changed files with 14 additions and 2,340 deletions.
diff --git a/docs/tutorials/onnx_export.ipynb b/docs/tutorials/onnx_export.ipynb
diff --git a/docs/tutorials/tvmcon2021.ipynb b/docs/tutorials/tvmcon2021.ipynb
@@ -1882,93 +1882,6 @@
  " return IFrame(src=f\"http://localhost:{port}/\", width=\"100%\", height=400)"
  ]
  },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Export to ONNX QOps\n",
- "\n",
- "Say we want to export a QuantConv1d with 4b symmetric weights, 8b symmetric inputs and outputs, and 16 biases. \n",
- "We can export it to a ONNX's `QLinearConv`, but some information will be lost. In particular, weights will be represented as 8b and bias as 32b, even though they are respectively 4b and 16b. This is because ONNX does not provide a standardized way to represent them as such:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "torch.manual_seed(0)\n",
- "\n",
- "from brevitas.nn import QuantConv1d\n",
- "from brevitas.quant import Int8WeightPerTensorFloat, Int8ActPerTensorFloat, Int16Bias\n",
- "from brevitas.export import export_onnx_qop\n",
- "\n",
- "float_inp = torch.randn(1, 2, 5)\n",
- "\n",
- "quant_conv_4b8b = QuantConv1d(\n",
- " 2, 4, 3, bias=True, weight_bit_width=4,\n",
- " input_quant=Int8ActPerTensorFloat,\n",
- " output_quant=Int8ActPerTensorFloat,\n",
- " bias_quant=Int16Bias)\n",
- "\n",
- "output_path = 'qop_onnx_conv_4b8b.onnx'\n",
- "export_onnx_qop(quant_conv_4b8b, input_t=float_inp, export_path=output_path)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {
- "tags": [
- "skip-execution"
- ]
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Serving 'qop_onnx_conv_4b8b.onnx' at http://localhost:8082\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- " <iframe\n",
- " width=\"100%\"\n",
- " height=\"400\"\n",
- " src=\"http://localhost:8082/\"\n",
- " frameborder=\"0\"\n",
- " allowfullscreen\n",
- " \n",
- " ></iframe>\n",
- " "
- ],
- "text/plain": [
- "<IPython.lib.display.IFrame at 0x1720d689b38>"
- ]
- },
- "execution_count": 39,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "show_netron(output_path, 8082)"
- ]
- },
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In general the standard ONNX opset doesn't support representing quantization below 8b. Additionally, ONNX QOp representation requires an output quantizer to be set at part of of the layer. \n",
- "\n",
- "The constraint of always having an output quantizer is relaxed in the more recently introduced QDQ style of representation (for which there is support in Brevitas starting from version 0.8), which uses only `QuantizeLinear` and `DequantizeLinear` to represent quantization, but even with that support is still limited to 8b quantization."
- ]
- },
  {
  "cell_type": "markdown",
  "metadata": {},
@@ -2112,93 +2025,7 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "The custom format shown above can integrated into ONNX-based toolchains, e.g. it's supported by our own FINN toolchain for low-precision dataflow style custom FPGAs implementations, and would be a starting point for direct integration with TVM.\n",
- "\n",
- "## Export to TorchScript quantization backend\n",
- "\n",
- "It's also possible to export to TorchScript own quantized functional operators, which come with their own set of restrictions. In particular, weights should be 7b and unsigned, which requires a zero-point. We can model that with appropriate quantizers:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "from brevitas.quant import ShiftedUint8ActPerTensorFloat\n",
- "from brevitas.export import export_torch_qop\n",
- "\n",
- "\n",
- "quant_conv_8b7b = QuantConv1d(\n",
- " 2, 4, 3, bias=True,\n",
- " input_quant=ShiftedUint8ActPerTensorFloat,\n",
- " output_quant=ShiftedUint8ActPerTensorFloat,\n",
- " weight_bit_width=7,\n",
- " bias_quant=Int16Bias)\n",
- "\n",
- "output_path = 'pytorch_qf_conv_8b7b.pt'\n",
- "export_torch_qop(quant_conv_8b7b, input_t=float_inp, export_path=output_path)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {
- "tags": [
- "skip-execution"
- ]
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "c:\\users\\alessandro\\documenti\\brevitas_tvmcon\\src\\brevitas\\quant_tensor\\__init__.py:74: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.\n",
- " training = torch.tensor(training, dtype=torch.bool)\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Serving 'pytorch_qf_conv_8b7b.pt' at http://localhost:8085\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- " <iframe\n",
- " width=\"100%\"\n",
- " height=\"400\"\n",
- " src=\"http://localhost:8085/\"\n",
- " frameborder=\"0\"\n",
- " allowfullscreen\n",
- " \n",
- " ></iframe>\n",
- " "
- ],
- "text/plain": [
- "<IPython.lib.display.IFrame at 0x1720e87a438>"
- ]
- },
- "execution_count": 42,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "show_netron(output_path, 8085)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As we can see though information on the fact that activations are 7b is lost, and they simply marked as 8b.\n",
- "\n",
- "Additionally, because bias quantization is not represented explicitly (although it is performed implicitly at 32b at runtime in the backend), any information around that is lost.\n",
- "As with standard ONNX, representing precisions below 8b is not possible."
+ "The custom format shown above can integrated into ONNX-based toolchains, e.g. it's supported by our own FINN toolchain for low-precision dataflow style custom FPGAs implementations, and would be a starting point for direct integration with TVM."
  ]
  },
  {

diff --git a/notebooks/Brevitas_TVMCon2021.ipynb b/notebooks/Brevitas_TVMCon2021.ipynb
@@ -1903,102 +1903,6 @@
  " return IFrame(src=f\"http://localhost:{port}/\", width=\"100%\", height=400)"
  ]
  },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Export to ONNX QOps\n",
- "\n",
- "Say we want to export a QuantConv1d with 4b symmetric weights, 8b symmetric inputs and outputs, and 16 biases. \n",
- "We can export it to a ONNX's `QLinearConv`, but some information will be lost. In particular, weights will be represented as 8b and bias as 32b, even though they are respectively 4b and 16b. This is because ONNX does not provide a standardized way to represent them as such:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "/scratch/fabian/brevitas/src/brevitas/export/onnx/standard/manager.py:26: UserWarning: ONNX opset version set to 13, override with opset_version=\n",
- " warnings.warn(f\"ONNX opset version set to {DEFAULT_OPSET}, override with {ka}=\")\n"
- ]
- }
- ],
- "source": [
- "torch.manual_seed(0)\n",
- "\n",
- "from brevitas.nn import QuantConv1d\n",
- "from brevitas.quant import Int8WeightPerTensorFloat, Int8ActPerTensorFloat, Int16Bias\n",
- "from brevitas.export import export_onnx_qop\n",
- "\n",
- "float_inp = torch.randn(1, 2, 5)\n",
- "\n",
- "quant_conv_4b8b = QuantConv1d(\n",
- " 2, 4, 3, bias=True, weight_bit_width=4,\n",
- " input_quant=Int8ActPerTensorFloat,\n",
- " output_quant=Int8ActPerTensorFloat,\n",
- " bias_quant=Int16Bias)\n",
- "\n",
- "output_path = 'qop_onnx_conv_4b8b.onnx'\n",
- "exported_model = export_onnx_qop(quant_conv_4b8b, input_t=float_inp, export_path=output_path)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "metadata": {
- "tags": [
- "skip-execution"
- ]
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Serving 'qop_onnx_conv_4b8b.onnx' at http://localhost:8082\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- " <iframe\n",
- " width=\"100%\"\n",
- " height=\"400\"\n",
- " src=\"http://localhost:8082/\"\n",
- " frameborder=\"0\"\n",
- " allowfullscreen\n",
- " \n",
- " ></iframe>\n",
- " "
- ],
- "text/plain": [
- "<IPython.lib.display.IFrame at 0x7f92ca3e1a10>"
- ]
- },
- "execution_count": 40,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "show_netron(output_path, 8082)"
- ]
- },
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In general the standard ONNX opset doesn't support representing quantization below 8b. Additionally, ONNX QOp representation requires an output quantizer to be set at part of of the layer. \n",
- "\n",
- "The constraint of always having an output quantizer is relaxed in the more recently introduced QDQ style of representation (for which there is support in Brevitas starting from version 0.8), which uses only `QuantizeLinear` and `DequantizeLinear` to represent quantization, but even with that support is still limited to 8b quantization."
- ]
- },
  {
  "cell_type": "markdown",
  "metadata": {},
@@ -2142,85 +2046,7 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "The custom format shown above can integrated into ONNX-based toolchains, e.g. it's supported by our own FINN toolchain for low-precision dataflow style custom FPGAs implementations, and would be a starting point for direct integration with TVM.\n",
- "\n",
- "## Export to TorchScript quantization backend\n",
- "\n",
- "It's also possible to export to TorchScript own quantized functional operators, which come with their own set of restrictions. In particular, weights should be 7b and unsigned, which requires a zero-point. We can model that with appropriate quantizers:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [],
- "source": [
- "from brevitas.quant import ShiftedUint8ActPerTensorFloat\n",
- "from brevitas.export import export_torch_qop\n",
- "\n",
- "\n",
- "quant_conv_8b7b = QuantConv1d(\n",
- " 2, 4, 3, bias=True,\n",
- " input_quant=ShiftedUint8ActPerTensorFloat,\n",
- " output_quant=ShiftedUint8ActPerTensorFloat,\n",
- " weight_bit_width=7,\n",
- " bias_quant=Int16Bias)\n",
- "\n",
- "output_path = 'pytorch_qf_conv_8b7b.pt'\n",
- "exported_model = export_torch_qop(quant_conv_8b7b, input_t=float_inp, export_path=output_path)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 46,
- "metadata": {
- "tags": [
- "skip-execution"
- ]
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Serving 'pytorch_qf_conv_8b7b.pt' at http://localhost:8085\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "\n",
- " <iframe\n",
- " width=\"100%\"\n",
- " height=\"400\"\n",
- " src=\"http://localhost:8085/\"\n",
- " frameborder=\"0\"\n",
- " allowfullscreen\n",
- " \n",
- " ></iframe>\n",
- " "
- ],
- "text/plain": [
- "<IPython.lib.display.IFrame at 0x7f92ca4a9550>"
- ]
- },
- "execution_count": 46,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "show_netron(output_path, 8085)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "As we can see though information on the fact that activations are 7b is lost, and they simply marked as 8b.\n",
- "\n",
- "Additionally, because bias quantization is not represented explicitly (although it is performed implicitly at 32b at runtime in the backend), any information around that is lost.\n",
- "As with standard ONNX, representing precisions below 8b is not possible."
+ "The custom format shown above can integrated into ONNX-based toolchains, e.g. it's supported by our own FINN toolchain for low-precision dataflow style custom FPGAs implementations, and would be a starting point for direct integration with TVM."
  ]
  },
  {