From 0a543e63604fdfbe823036e87d7486be86a83a02 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Thu, 13 Jun 2024 17:50:03 +0200 Subject: [PATCH 01/27] clean documentation --- docs/source/_toctree.yml | 12 ++++++------ .../{ => neural_compressor}/distributed_training.mdx | 0 .../optimization.mdx} | 0 .../reference.mdx} | 0 docs/source/{ => openvino}/inference.mdx | 0 .../optimization.mdx} | 0 .../{reference_ov.mdx => openvino/reference.mdx} | 0 7 files changed, 6 insertions(+), 6 deletions(-) rename docs/source/{ => neural_compressor}/distributed_training.mdx (100%) rename docs/source/{optimization_inc.mdx => neural_compressor/optimization.mdx} (100%) rename docs/source/{reference_inc.mdx => neural_compressor/reference.mdx} (100%) rename docs/source/{ => openvino}/inference.mdx (100%) rename docs/source/{optimization_ov.mdx => openvino/optimization.mdx} (100%) rename docs/source/{reference_ov.mdx => openvino/reference.mdx} (100%) diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 66e4afd86..be5b31077 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -4,19 +4,19 @@ - local: installation title: Installation - sections: - - local: optimization_inc + - local: neural_compressor/optimization title: Optimization - - local: distributed_training + - local: neural_compressor/distributed_training title: Distributed Training - - local: reference_inc + - local: neural_compressor/reference title: Reference title: Neural Compressor - sections: - - local: inference + - local: openvino/inference title: Models for inference - - local: optimization_ov + - local: openvino/optimization title: Optimization - - local: reference_ov + - local: openvino/reference title: Reference title: OpenVINO title: Optimum Intel diff --git a/docs/source/distributed_training.mdx b/docs/source/neural_compressor/distributed_training.mdx similarity index 100% rename from docs/source/distributed_training.mdx rename to docs/source/neural_compressor/distributed_training.mdx diff --git a/docs/source/optimization_inc.mdx b/docs/source/neural_compressor/optimization.mdx similarity index 100% rename from docs/source/optimization_inc.mdx rename to docs/source/neural_compressor/optimization.mdx diff --git a/docs/source/reference_inc.mdx b/docs/source/neural_compressor/reference.mdx similarity index 100% rename from docs/source/reference_inc.mdx rename to docs/source/neural_compressor/reference.mdx diff --git a/docs/source/inference.mdx b/docs/source/openvino/inference.mdx similarity index 100% rename from docs/source/inference.mdx rename to docs/source/openvino/inference.mdx diff --git a/docs/source/optimization_ov.mdx b/docs/source/openvino/optimization.mdx similarity index 100% rename from docs/source/optimization_ov.mdx rename to docs/source/openvino/optimization.mdx diff --git a/docs/source/reference_ov.mdx b/docs/source/openvino/reference.mdx similarity index 100% rename from docs/source/reference_ov.mdx rename to docs/source/openvino/reference.mdx From 38d6b27aea7cf9fbcaa39109c93f981b161b38b1 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Thu, 13 Jun 2024 18:18:02 +0200 Subject: [PATCH 02/27] fix link --- docs/source/index.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/index.mdx b/docs/source/index.mdx index 643b9be04..99b25411a 100644 --- a/docs/source/index.mdx +++ b/docs/source/index.mdx @@ -25,11 +25,11 @@ limitations under the License.
-
Neural Compressor

Learn how to apply compression techniques such as quantization, pruning and knowledge distillation to speed up inference with Intel Neural Compressor.

-
OpenVINO

Learn how to run inference with OpenVINO Runtime and to apply quantization, pruning and knowledge distillation on your model to further speed up inference.

From f5fff7a3c4459c0161efb5886fb1e9998ad07bd4 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Thu, 13 Jun 2024 18:31:59 +0200 Subject: [PATCH 03/27] add export section --- docs/source/_toctree.yml | 4 +- docs/source/openvino/export.mdx | 65 ++++++++++++++++++++++++++++++ docs/source/openvino/inference.mdx | 2 +- 3 files changed, 69 insertions(+), 2 deletions(-) create mode 100644 docs/source/openvino/export.mdx diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index be5b31077..6b6c80f8a 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -12,8 +12,10 @@ title: Reference title: Neural Compressor - sections: + - local: openvino/export + title: Export - local: openvino/inference - title: Models for inference + title: Inference - local: openvino/optimization title: Optimization - local: openvino/reference diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx new file mode 100644 index 000000000..c19d07405 --- /dev/null +++ b/docs/source/openvino/export.mdx @@ -0,0 +1,65 @@ + + +# Export your model + +## Exporting a model using the CLI + +It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI : + +```bash +optimum-cli export openvino --model gpt2 openvino_model/ +``` + +The model argument can either be the model ID of a model hosted on the [HF hub](https://huggingface.co/models) or a path to a model hosted locally. + +Check out the help for more options: + +```bash +optimum-cli export openvino --help +``` + +### Weight-only quantization + +You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`: + +```bash +optimum-cli export openvino --model gpt2 --weight-format int8 openvino_model/ +``` + +This type of optimization allows to reduce the memory footprint and inference latency. + +By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`. + +For INT4 quantization you can also specify the following arguments : +* The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization. +* The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`. + +Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency. + + + +Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`. + + + +Once the model is exported, you can now [load your OpenVINO model](https://huggingface.co/docs/optimum/main/en/intel/inference) + + +## Exporting when loading your model + +You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model. + +```python +from optimum.intel import OVModelForCausalLM + +model = OVModelForCausalLM.from_pretrained("gpt2", export=True) +model.save_pretrained("openvino_model") +tokenizer.save_pretrained("openvino_model") +``` diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx index 305beac3c..0d42fe566 100644 --- a/docs/source/openvino/inference.mdx +++ b/docs/source/openvino/inference.mdx @@ -7,7 +7,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Optimum Inference with OpenVINO +# Inference Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime without rewriting your APIs. From cebeabb24d59588e601b680e2890034caf9154b4 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Thu, 13 Jun 2024 18:57:58 +0200 Subject: [PATCH 04/27] add section --- docs/source/openvino/export.mdx | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index c19d07405..1ca366b44 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -17,7 +17,8 @@ It is possible to export your model to the [OpenVINO IR](https://docs.openvino.a optimum-cli export openvino --model gpt2 openvino_model/ ``` -The model argument can either be the model ID of a model hosted on the [HF hub](https://huggingface.co/models) or a path to a model hosted locally. +The model argument can either be the model ID of a model hosted on the [Hub](https://huggingface.co/models) or a path to a model hosted locally. + Check out the help for more options: @@ -25,9 +26,22 @@ Check out the help for more options: optimum-cli export openvino --help ``` -### Weight-only quantization +#### Task + +If the task argument is not provided, it will be automatically inferred. + +For model hosted locally, you need to specify it among the list of the [supported tasks](https://huggingface.co/docs/optimum/exporters/task_manager): + +```bash +optimum-cli export openvino --model local_model_dir --task text-generation-with-past openvino_model +``` + +The `-with-past` suffix enable the re-use of the pre-computed key/values hidden-states and is the recommended option. To export the model without (equivalent to `use_cache=False`), you will need to remove this suffix. + -You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`: +#### Quantization + +You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`: ```bash optimum-cli export openvino --model gpt2 --weight-format int8 openvino_model/ @@ -49,8 +63,7 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with -Once the model is exported, you can now [load your OpenVINO model](https://huggingface.co/docs/optimum/main/en/intel/inference) - +Once the model is exported, you can now [load your OpenVINO model](https://huggingface.co/docs/optimum/main/en/intel/openvino/inference) ## Exporting when loading your model From aea9cebfa6e3fd926a096b28ca2c85c87ecd6ab7 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Thu, 13 Jun 2024 19:03:11 +0200 Subject: [PATCH 05/27] remove export subsection in inference --- docs/source/openvino/export.mdx | 2 +- docs/source/openvino/inference.mdx | 76 ++---------------------------- 2 files changed, 4 insertions(+), 74 deletions(-) diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index 1ca366b44..a9967bfbf 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -63,7 +63,7 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with -Once the model is exported, you can now [load your OpenVINO model](https://huggingface.co/docs/optimum/main/en/intel/openvino/inference) +Once the model is exported, you can now [load your OpenVINO model](inference) ## Exporting when loading your model diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx index 0d42fe566..dbb913530 100644 --- a/docs/source/openvino/inference.mdx +++ b/docs/source/openvino/inference.mdx @@ -11,77 +11,7 @@ specific language governing permissions and limitations under the License. Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime without rewriting your APIs. -## Transformers models - -You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors -([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). -For that, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. - -As shown in the table below, each task is associated with a class enabling to automatically load your model. - -| Task | Auto Class | -|--------------------------------------|--------------------------------------| -| `text-classification` | `OVModelForSequenceClassification` | -| `token-classification` | `OVModelForTokenClassification` | -| `question-answering` | `OVModelForQuestionAnswering` | -| `audio-classification` | `OVModelForAudioClassification` | -| `image-classification` | `OVModelForImageClassification` | -| `feature-extraction` | `OVModelForFeatureExtraction` | -| `fill-mask` | `OVModelForMaskedLM` | -| `image-classification` | `OVModelForImageClassification` | -| `audio-classification` | `OVModelForAudioClassification` | -| `text-generation-with-past` | `OVModelForCausalLM` | -| `text2text-generation-with-past` | `OVModelForSeq2SeqLM` | -| `automatic-speech-recognition` | `OVModelForSpeechSeq2Seq` | -| `image-to-text` | `OVModelForVision2Seq` | - - -### Export - -It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI : - -```bash -optimum-cli export openvino --model gpt2 ov_model -``` - -The example above illustrates exporting a checkpoint from the πŸ€— Hub. When exporting a local model, first make sure that you saved both the model’s weights and tokenizer files in the same directory (`local_path`). -When using CLI, pass the `local_path` to the model argument instead of the checkpoint name of the model hosted on the Hub and provide the `--task` argument. You can review the list of supported tasks in the πŸ€— [Optimum documentation](https://huggingface.co/docs/optimum/exporters/task_manager). If task argument is not provided, it will default to the model architecture without any task specific head. -The `-with-past` suffix enable the re-use of the pre-computed key/values hidden-states and is the recommended option, to export the model without (equivalent to `use_cache=False`), you will need to remove this suffix. - -```bash -optimum-cli export openvino --model local_path --task text-generation-with-past ov_model -``` - -To export your model in fp16, you can add `--weight-format fp16` when exporting your model. - - - -Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`. - - - -Once the model is exported, you can load the OpenVINO model using : - -```python -from optimum.intel import OVModelForCausalLM - -model_id = "ov_model" -model = OVModelForCausalLM.from_pretrained(model_id) -``` - -You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model. - -```python -from optimum.intel import OVModelForCausalLM - -model_id = "gpt2" -model = OVModelForCausalLM.from_pretrained(model_id, export=True) -model.save_pretrained("ov_model") -``` - -### Inference - -You can load an OpenVINO hosted on the hub and perform inference, no need to adapt your code to get it to work with `OVModelForXxx` classes: +Once [your model was exported](export), you can load it by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. ```diff - from transformers import AutoModelForCausalLM @@ -96,7 +26,7 @@ You can load an OpenVINO hosted on the hub and perform inference, no need to ada results = pipe("He's a dreadful magician and") ``` -See the [reference documentation](reference_ov) for more information about parameters, and examples for different tasks. +See the [reference documentation](reference) for more information about parameters, and examples for different tasks. To easily save the resulting model, you can use the `save_pretrained()` method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model. @@ -140,7 +70,7 @@ If not specified, `load_in_8bit` will be set to `True` by default when models la -To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov#optimization). +To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](optimization). ### Static shape From 06b50f3afc1e3012507aeceea20961aabd474e40 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Thu, 13 Jun 2024 19:04:57 +0200 Subject: [PATCH 06/27] rephrase --- docs/source/index.mdx | 2 +- docs/source/openvino/export.mdx | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/index.mdx b/docs/source/index.mdx index 99b25411a..75e99d868 100644 --- a/docs/source/index.mdx +++ b/docs/source/index.mdx @@ -29,7 +29,7 @@ limitations under the License. >
Neural Compressor

Learn how to apply compression techniques such as quantization, pruning and knowledge distillation to speed up inference with Intel Neural Compressor.

-
OpenVINO

Learn how to run inference with OpenVINO Runtime and to apply quantization, pruning and knowledge distillation on your model to further speed up inference.

diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index a9967bfbf..0bce0df42 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License. # Export your model -## Exporting a model using the CLI +## Using the CLI It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI : @@ -65,7 +65,7 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with Once the model is exported, you can now [load your OpenVINO model](inference) -## Exporting when loading your model +## During loading You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model. From fcafa4ceee5a9fa542b9106fb82f5615e6da2624 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Thu, 13 Jun 2024 19:07:01 +0200 Subject: [PATCH 07/27] add back suvsection --- docs/source/openvino/inference.mdx | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx index dbb913530..018d9c9a7 100644 --- a/docs/source/openvino/inference.mdx +++ b/docs/source/openvino/inference.mdx @@ -11,6 +11,8 @@ specific language governing permissions and limitations under the License. Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime without rewriting your APIs. +## Transformers models + Once [your model was exported](export), you can load it by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. ```diff @@ -39,14 +41,7 @@ tokenizer.save_pretrained(save_directory) ### Weight-only quantization -You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`: - -```bash -optimum-cli export openvino --model gpt2 --weight-format int8 ov_model -``` - -This type of optimization allows to reduce the memory footprint and inference latency. - +You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when loading your model to reduce the memory footprint and inference latency. By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`. From 550ee2b8bb291c9a1613ed7848a8e4a2af27b9d1 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Thu, 13 Jun 2024 19:11:12 +0200 Subject: [PATCH 08/27] fix link --- docs/source/openvino/export.mdx | 2 +- docs/source/openvino/inference.mdx | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index 0bce0df42..d8ef36533 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -30,7 +30,7 @@ optimum-cli export openvino --help If the task argument is not provided, it will be automatically inferred. -For model hosted locally, you need to specify it among the list of the [supported tasks](https://huggingface.co/docs/optimum/exporters/task_manager): +For model hosted locally, you need to specify it among the list of the [supported tasks](https://huggingface.co/docs/optimum/main/en/exporters/task_manager): ```bash optimum-cli export openvino --model local_model_dir --task text-generation-with-past openvino_model diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx index 018d9c9a7..6d1892ff4 100644 --- a/docs/source/openvino/inference.mdx +++ b/docs/source/openvino/inference.mdx @@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License. # Inference -Optimum Intel can be used to load optimized models from the [Hugging Face Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime without rewriting your APIs. +Optimum Intel can be used to load optimized models from the [Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime without rewriting your APIs. ## Transformers models From 46aeef5c9c97a8c98083726a26f29a4a25e65a5b Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Fri, 14 Jun 2024 11:43:06 +0200 Subject: [PATCH 09/27] rephrase --- .../source/neural_compressor/optimization.mdx | 4 +- docs/source/openvino/export.mdx | 37 +++++++++---------- 2 files changed, 20 insertions(+), 21 deletions(-) diff --git a/docs/source/neural_compressor/optimization.mdx b/docs/source/neural_compressor/optimization.mdx index 086710766..833e2aeb5 100644 --- a/docs/source/neural_compressor/optimization.mdx +++ b/docs/source/neural_compressor/optimization.mdx @@ -16,7 +16,7 @@ Optimum Intel can be used to apply popular compression techniques such as quanti ## Post-training optimization -Post-training compression techniques such as dynamic and static quantization can be easily applied on your model using our [`INCQuantizer`](https://huggingface.co/docs/optimum/intel_optimization#optimum.intel.neural_compressor.IncQuantizer). +Post-training compression techniques such as dynamic and static quantization can be easily applied on your model using our [`INCQuantizer`](optimization). Note that quantization is currently only supported for CPUs (only CPU backends are available), so we will not be utilizing GPUs / CUDA in the following examples. ### Dynamic quantization @@ -252,7 +252,7 @@ To know more about the different supported methodologies, you can refer to the N ## Loading a quantized model -To load a quantized model hosted locally or on the πŸ€— hub, you must instantiate you model using our [`INCModelForXxx`](https://huggingface.co/docs/optimum/main/intel/reference_inc#optimum.intel.neural_compressor.quantization.INCModel) classes. +To load a quantized model hosted locally or on the πŸ€— hub, you must instantiate you model using our [`INCModelForXxx`](reference) classes. ```python from optimum.intel import INCModelForSequenceClassification diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index d8ef36533..6387ac9a5 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -11,10 +11,10 @@ specific language governing permissions and limitations under the License. ## Using the CLI -It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI : +To export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI : ```bash -optimum-cli export openvino --model gpt2 openvino_model/ +optimum-cli export openvino --model gpt2 ov_model/ ``` The model argument can either be the model ID of a model hosted on the [Hub](https://huggingface.co/models) or a path to a model hosted locally. @@ -30,13 +30,17 @@ optimum-cli export openvino --help If the task argument is not provided, it will be automatically inferred. -For model hosted locally, you need to specify it among the list of the [supported tasks](https://huggingface.co/docs/optimum/main/en/exporters/task_manager): +For local models, you need to specify it among the list of the [supported tasks](https://huggingface.co/docs/optimum/main/en/exporters/task_manager): ```bash -optimum-cli export openvino --model local_model_dir --task text-generation-with-past openvino_model +optimum-cli export openvino --model local_model_dir --task text-generation-with-past ov_model/ ``` -The `-with-past` suffix enable the re-use of the pre-computed key/values hidden-states and is the recommended option. To export the model without (equivalent to `use_cache=False`), you will need to remove this suffix. +#### Exporting a model using past keys/values in the decoder + +When exporting a decoder model used for generation, it can be useful to encapsulate in the exported model the [reuse of past keys and values](https://discuss.huggingface.co/t/what-is-the-purpose-of-use-cache-in-decoder/958/2). This allows to avoid recomputing the same intermediate activations during the generation. + +This behavior corresponds to `--task text-geeneration-with-past`, `--task text2text-generation-with-past`, or `--task automatic-speech-recognition-with-past`. If for any purpose you would like to disable the export with past keys/values reuse, passing explicitly to `optimum-cli export openvino` the task `text2text-generation`, `text-generation` or `automatic-speech-recognition` is required. #### Quantization @@ -44,18 +48,11 @@ The `-with-past` suffix enable the re-use of the pre-computed key/values hidden- You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`: ```bash -optimum-cli export openvino --model gpt2 --weight-format int8 openvino_model/ +optimum-cli export openvino --model gpt2 --weight-format int8 ov_model/ ``` -This type of optimization allows to reduce the memory footprint and inference latency. - -By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`. +For more information on the quantization parameters checkout the [documentation](inference#weight-only-quantization) -For INT4 quantization you can also specify the following arguments : -* The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization. -* The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`. - -Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency. @@ -65,14 +62,16 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with Once the model is exported, you can now [load your OpenVINO model](inference) -## During loading +#### Custom export + + You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model. ```python -from optimum.intel import OVModelForCausalLM - model = OVModelForCausalLM.from_pretrained("gpt2", export=True) -model.save_pretrained("openvino_model") -tokenizer.save_pretrained("openvino_model") +model.save_pretrained("ov_model") + ``` + + From cf900b990b45f247cd054c5fc13c57da42800960 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Fri, 14 Jun 2024 16:49:43 +0200 Subject: [PATCH 10/27] add --- docs/source/openvino/export.mdx | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index 6387ac9a5..a4cd05d3b 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -28,6 +28,8 @@ optimum-cli export openvino --help #### Task +Specifying a --task should not be necessary in most cases when exporting from a model on the Hugging Face Hub. + If the task argument is not provided, it will be automatically inferred. For local models, you need to specify it among the list of the [supported tasks](https://huggingface.co/docs/optimum/main/en/exporters/task_manager): From 33dc3864281d9b2ac3754248f96fa6f0eb00815e Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Mon, 24 Jun 2024 15:40:28 +0200 Subject: [PATCH 11/27] Add section --- docs/source/openvino/export.mdx | 100 ++++++++++++++++++------ optimum/commands/export/openvino.py | 49 ++++++------ optimum/intel/openvino/configuration.py | 4 +- 3 files changed, 105 insertions(+), 48 deletions(-) diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index a4cd05d3b..8f7c4c285 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -17,33 +17,82 @@ To export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/document optimum-cli export openvino --model gpt2 ov_model/ ``` -The model argument can either be the model ID of a model hosted on the [Hub](https://huggingface.co/models) or a path to a model hosted locally. +The model argument can either be the model ID of a model hosted on the [Hub](https://huggingface.co/models) or a path to a model hosted locally. For local models, you need to specify the task for which the model should be loaded before export, among the list of the [supported tasks](https://huggingface.co/docs/optimum/main/en/exporters/task_manager). -Check out the help for more options: - ```bash -optimum-cli export openvino --help +optimum-cli export openvino --model local_model_dir --task text-generation-with-past ov_model/ ``` -#### Task +The `-with-past` suffix enable the re-use of the pre-computed key/values hidden-states and is the recommended option, to export the model without, you will need to remove this suffix. -Specifying a --task should not be necessary in most cases when exporting from a model on the Hugging Face Hub. +| With K-V cache | Without K-V cache | +|------------------------------------------|--------------------------------------| +| `text-generation-with-past` | `text-generation` | +| `text2text-generation-with-past` | `text2text-generation` | +| `automatic-speech-recognition-with-past` | `automatic-speech-recognition` | -If the task argument is not provided, it will be automatically inferred. -For local models, you need to specify it among the list of the [supported tasks](https://huggingface.co/docs/optimum/main/en/exporters/task_manager): +Check out the help for more options: ```bash -optimum-cli export openvino --model local_model_dir --task text-generation-with-past ov_model/ -``` - -#### Exporting a model using past keys/values in the decoder - -When exporting a decoder model used for generation, it can be useful to encapsulate in the exported model the [reuse of past keys and values](https://discuss.huggingface.co/t/what-is-the-purpose-of-use-cache-in-decoder/958/2). This allows to avoid recomputing the same intermediate activations during the generation. - -This behavior corresponds to `--task text-geeneration-with-past`, `--task text2text-generation-with-past`, or `--task automatic-speech-recognition-with-past`. If for any purpose you would like to disable the export with past keys/values reuse, passing explicitly to `optimum-cli export openvino` the task `text2text-generation`, `text-generation` or `automatic-speech-recognition` is required. +optimum-cli export openvino --help +usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code] [--weight-format {fp32,fp16,int8,int4}] + [--library {transformers,diffusers,timm,sentence_transformers}] [--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym] + [--group-size GROUP_SIZE] [--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--sensitivity-metric SENSITIVITY_METRIC] [--num-samples NUM_SAMPLES] + [--disable-stateful] [--disable-convert-tokenizer] + output + +optional arguments: + -h, --help show this help message and exit + +Required arguments: + --model MODEL Model ID on huggingface.co or path on disk to load model from. + + output Path indicating the directory where to store the generated OV model. + +Optional arguments: + --task TASK The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['image-segmentation', + 'feature-extraction', 'mask-generation', 'audio-classification', 'conversational', 'stable-diffusion-xl', 'question-answering', 'sentence-similarity', 'text2text-generation', + 'masked-im', 'automatic-speech-recognition', 'fill-mask', 'image-to-text', 'text-generation', 'zero-shot-object-detection', 'multiple-choice', 'object-detection', 'stable- + diffusion', 'audio-xvector', 'text-to-audio', 'zero-shot-image-classification', 'token-classification', 'image-classification', 'depth-estimation', 'image-to-image', 'audio- + frame-classification', 'semantic-segmentation', 'text-classification']. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder. + --framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local checkpoint's original framework or what is available in the environment. + --trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it + will execute on your local machine arbitrary code present in the model repository. + --weight-format {fp32,fp16,int8,int4} + The weight format of the exported model. + --library {transformers,diffusers,timm,sentence_transformers} + The library used to load the model before export. If not provided, will attempt to infer the local checkpoint's library. + --cache_dir CACHE_DIR + The path to a directory in which the downloaded model should be cached if the standard cache should not be used. + --pad-token-id PAD_TOKEN_ID + This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it. + --ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80% of the layers will be quantized to int4 while + 20% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. + --sym Whether to apply symmetric quantization + --group-size GROUP_SIZE + The group size to use for int4 quantization. Recommended value is 128 and -1 will results in per-column quantization. + --dataset DATASET The dataset used for data-aware compression or quantization with NNCF. You can use the one from the list ['wikitext2','c4','c4-new'] for language models or + ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models. + --all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied, they are compressed to INT8. + --awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but requires additional time for tuning weights on a calibration dataset. To run AWQ, + please also provide a dataset argument. Note: it's possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped. + --scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed layers. Providing a dataset is required to run scale + estimation. Please note, that applying scale estimation takes additional memory and time. + --sensitivity-metric SENSITIVITY_METRIC + The sensitivity metric for assigning quantization precision to layers. Can be one of the following: ['weight_quantization_error', 'hessian_input_activation', + 'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude']. + --num-samples NUM_SAMPLES + The maximum number of samples to take from the dataset for quantization. + --disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. In stateful models all kv-cache + inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-stateful option is used, it may result in sub-optimal inference + performance. Use it when you intentionally want to use a stateless model, for example, to be compatible with existing OpenVINO native inference code that expects kv-cache inputs + and outputs in the model. + --disable-convert-tokenizer + Do not add converted tokenizer and detokenizer OpenVINO models. +``` #### Quantization @@ -62,18 +111,25 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with -Once the model is exported, you can now [load your OpenVINO model](inference) - -#### Custom export +Once the model is exported, you can now [load your OpenVINO model](inference) by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. - +#### When loading your model You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model. ```python +from optimum.intel import OVModelForCausalLM + model = OVModelForCausalLM.from_pretrained("gpt2", export=True) model.save_pretrained("ov_model") - ``` - +#### After loading your model + +```python +from transfomers import AutoModelForCausalLM +from optimum.exporters.openvino import export_from_model + +model = AutoModelForCausalLM.from_pretrained("gpt2") +export_from_model(model, output="ov_model", task="text-generation-with-past") +``` diff --git a/optimum/commands/export/openvino.py b/optimum/commands/export/openvino.py index 17bcea965..a6c988121 100644 --- a/optimum/commands/export/openvino.py +++ b/optimum/commands/export/openvino.py @@ -51,9 +51,6 @@ def parse_args_openvino(parser: "ArgumentParser"): f" {str(TasksManager.get_all_tasks())}. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder." ), ) - optional_group.add_argument( - "--cache_dir", type=str, default=HUGGINGFACE_HUB_CACHE, help="Path indicating where to store cache." - ) optional_group.add_argument( "--framework", type=str, @@ -72,22 +69,31 @@ def parse_args_openvino(parser: "ArgumentParser"): ), ) optional_group.add_argument( - "--pad-token-id", - type=int, + "--weight-format", + type=str, + choices=["fp32", "fp16", "int8", "int4", "int4_sym_g128", "int4_asym_g128", "int4_sym_g64", "int4_asym_g64"], default=None, - help=( - "This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it." - ), + help="he weight format of the exported model.", ) - optional_group.add_argument("--fp16", action="store_true", help="Compress weights to fp16") - optional_group.add_argument("--int8", action="store_true", help="Compress weights to int8") optional_group.add_argument( - "--weight-format", + "--library", type=str, - choices=["fp32", "fp16", "int8", "int4", "int4_sym_g128", "int4_asym_g128", "int4_sym_g64", "int4_asym_g64"], + choices=["transformers", "diffusers", "timm", "sentence_transformers"], + default=None, + help="The library used to laod the model before export. If not provided, will attempt to infer the local checkpoint's library", + ) + optional_group.add_argument( + "--cache_dir", + type=str, + default=HUGGINGFACE_HUB_CACHE, + help="The path to a directory in which the downloaded model should be cached if the standard cache should not be used.", + ) + optional_group.add_argument( + "--pad-token-id", + type=int, default=None, help=( - "The weight format of the exporting model, e.g. f32 stands for float32 weights, f16 - for float16 weights, i8 - INT8 weights, int4_* - for INT4 compressed weights." + "This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it." ), ) optional_group.add_argument( @@ -95,8 +101,8 @@ def parse_args_openvino(parser: "ArgumentParser"): type=float, default=None, help=( - "Compression ratio between primary and backup precision. In the case of INT4, NNCF evaluates layer sensitivity and keeps the most impactful layers in INT8" - "precision (by default 20%% in INT8). This helps to achieve better accuracy after weight compression." + "A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80%% of the layers will be quantized to int4 " + "while 20%% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency." ), ) optional_group.add_argument( @@ -117,7 +123,7 @@ def parse_args_openvino(parser: "ArgumentParser"): default=None, help=( "The dataset used for data-aware compression or quantization with NNCF. " - "You can use the one from the list ['wikitext2','c4','c4-new','ptb','ptb-new'] for LLLMs " + "You can use the one from the list ['wikitext2','c4','c4-new'] for language models " "or ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models." ), ) @@ -183,20 +189,15 @@ def parse_args_openvino(parser: "ArgumentParser"): action="store_true", help="Do not add converted tokenizer and detokenizer OpenVINO models.", ) + #TODO : deprecated + optional_group.add_argument("--fp16", action="store_true", help="Compress weights to fp16") + optional_group.add_argument("--int8", action="store_true", help="Compress weights to int8") optional_group.add_argument( "--convert-tokenizer", action="store_true", help="[Deprecated] Add converted tokenizer and detokenizer with OpenVINO Tokenizers.", ) - optional_group.add_argument( - "--library", - type=str, - choices=["transformers", "diffusers", "timm", "sentence_transformers"], - default=None, - help=("The library on the model. If not provided, will attempt to infer the local checkpoint's library"), - ) - class OVExportCommand(BaseOptimumCLICommand): COMMAND = CommandInfo(name="openvino", help="Export PyTorch models to OpenVINO IR.") diff --git a/optimum/intel/openvino/configuration.py b/optimum/intel/openvino/configuration.py index 30e550b54..6fea06d11 100644 --- a/optimum/intel/openvino/configuration.py +++ b/optimum/intel/openvino/configuration.py @@ -155,7 +155,7 @@ class OVWeightQuantizationConfig(OVQuantizationConfigBase): using the [`~PreTrainedTokenizer.save_pretrained`] method, e.g., `./my_model_directory/`. dataset (`str or List[str]`, *optional*): The dataset used for data-aware compression or quantization with NNCF. You can provide your own dataset - in a list of strings or just use the one from the list ['wikitext2','c4','c4-new','ptb','ptb-new'] for LLLMs + in a list of strings or just use the one from the list ['wikitext2','c4','c4-new'] for language models or ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models. Alternatively, you can provide data objects via `calibration_dataset` argument of `OVQuantizer.quantize()` method. @@ -230,7 +230,7 @@ def post_init(self): f"If you wish to provide a custom dataset, please use the `OVQuantizer` instead." ) if self.dataset is not None and isinstance(self.dataset, str): - llm_datasets = ["wikitext2", "c4", "c4-new", "ptb", "ptb-new"] + llm_datasets = ["wikitext2", "c4", "c4-new"] stable_diffusion_datasets = [ "conceptual_captions", "laion/220k-GPT4Vision-captions-from-LIVIS", From d2af44853c65fa5d33a8f129603fa5dbdae0f74a Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Mon, 24 Jun 2024 17:45:00 +0200 Subject: [PATCH 12/27] remove redundant section --- docs/source/openvino/export.mdx | 6 +++--- docs/source/openvino/inference.mdx | 18 ++---------------- 2 files changed, 5 insertions(+), 19 deletions(-) diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index 8f7c4c285..c6e228b7d 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -24,7 +24,7 @@ The model argument can either be the model ID of a model hosted on the [Hub](htt optimum-cli export openvino --model local_model_dir --task text-generation-with-past ov_model/ ``` -The `-with-past` suffix enable the re-use of the pre-computed key/values hidden-states and is the recommended option, to export the model without, you will need to remove this suffix. +The `-with-past` suffix enable the re-use of past keys and values. This allows to avoid recomputing the same intermediate activations during the generation. to export the model without, you will need to remove this suffix. | With K-V cache | Without K-V cache | |------------------------------------------|--------------------------------------| @@ -113,7 +113,7 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with Once the model is exported, you can now [load your OpenVINO model](inference) by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. -#### When loading your model +### When loading your model You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model. @@ -124,7 +124,7 @@ model = OVModelForCausalLM.from_pretrained("gpt2", export=True) model.save_pretrained("ov_model") ``` -#### After loading your model +### After loading your model ```python from transfomers import AutoModelForCausalLM diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx index 6d1892ff4..0ab9e6caf 100644 --- a/docs/source/openvino/inference.mdx +++ b/docs/source/openvino/inference.mdx @@ -33,7 +33,7 @@ See the [reference documentation](reference) for more information about paramete To easily save the resulting model, you can use the `save_pretrained()` method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model. ```python -# Save the exported model +# Save your model save_directory = "openvino_distilbert" model.save_pretrained(save_directory) tokenizer.save_pretrained(save_directory) @@ -43,21 +43,7 @@ tokenizer.save_pretrained(save_directory) You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when loading your model to reduce the memory footprint and inference latency. -By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`. - -For INT4 quantization you can also specify the following arguments : -* The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization. -* The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`. - -Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency. - -You can also apply 8-bit quantization on your model's weight when loading your model by setting the `load_in_8bit=True` argument when calling the `from_pretrained()` method. - -```python -from optimum.intel import OVModelForCausalLM - -model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True) -``` +For more information on the quantization parameters checkout the [documentation](inference#weight-only-quantization) From d78e82246ba805bb8dbad98ad3f25056499cacdd Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Mon, 24 Jun 2024 17:56:04 +0200 Subject: [PATCH 13/27] format --- optimum/commands/export/openvino.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimum/commands/export/openvino.py b/optimum/commands/export/openvino.py index a6c988121..97c3d6368 100644 --- a/optimum/commands/export/openvino.py +++ b/optimum/commands/export/openvino.py @@ -189,7 +189,7 @@ def parse_args_openvino(parser: "ArgumentParser"): action="store_true", help="Do not add converted tokenizer and detokenizer OpenVINO models.", ) - #TODO : deprecated + # TODO : deprecated optional_group.add_argument("--fp16", action="store_true", help="Compress weights to fp16") optional_group.add_argument("--int8", action="store_true", help="Compress weights to int8") optional_group.add_argument( From d74c2299dd3116af7d05f7125a02fc0cd942e6a7 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Mon, 24 Jun 2024 18:00:15 +0200 Subject: [PATCH 14/27] fix link --- docs/source/openvino/inference.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx index 0ab9e6caf..f84e59ae4 100644 --- a/docs/source/openvino/inference.mdx +++ b/docs/source/openvino/inference.mdx @@ -43,7 +43,7 @@ tokenizer.save_pretrained(save_directory) You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when loading your model to reduce the memory footprint and inference latency. -For more information on the quantization parameters checkout the [documentation](inference#weight-only-quantization) +For more information on the quantization parameters checkout the [documentation](optimziation#weight-only-quantization). @@ -51,7 +51,7 @@ If not specified, `load_in_8bit` will be set to `True` by default when models la -To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](optimization). +It's also possible to apply quantization on both weights and activations using the `OVQuantizer`, more information in the [documentation](optimization#static-quantization). ### Static shape From 50e590da98444d14114e157223135c794fad060a Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Mon, 24 Jun 2024 18:46:06 +0200 Subject: [PATCH 15/27] add architecture section --- docs/source/_toctree.yml | 2 + docs/source/openvino/architectures.mdx | 129 +++++++++++++++++++++++++ tests/openvino/test_modeling.py | 62 +++++++----- tests/openvino/utils_tests.py | 6 ++ 4 files changed, 173 insertions(+), 26 deletions(-) create mode 100644 docs/source/openvino/architectures.mdx diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 6b6c80f8a..2be5ba126 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -14,6 +14,8 @@ - sections: - local: openvino/export title: Export + - local: openvino/architectures + title: Architectures - local: openvino/inference title: Inference - local: openvino/optimization diff --git a/docs/source/openvino/architectures.mdx b/docs/source/openvino/architectures.mdx new file mode 100644 index 000000000..c59b92e47 --- /dev/null +++ b/docs/source/openvino/architectures.mdx @@ -0,0 +1,129 @@ + + +# Overview + +πŸ€— Optimum handles the export of models to OpenVINO in the `exporters.openvino` module. It provides classes, functions, and a command line interface to perform the export easily. +Here is the list of the supported architectures : + +## [Transformers](https://huggingface.co/docs/transformers/index): + +- Albert +- Albert +- Aquila +- Aquila +- Arctic +- Audio Spectrogram Transformer +- Baichuan 2 +- Bart +- Beit +- Bert +- BioGPT +- BlenderBot +- BlenderBotSmall +- Bloom +- CLIP +- Camembert +- ChatGLM +- CodeGen +- CodeGen2 +- Cohere +- ConvBert +- ConvNext +- ConvNextV2 +- DBRX +- Data2VecAudio +- Data2VecText +- Data2VecVision +- Deberta +- Deberta-v2 +- Deit +- DistilBert +- Donut-Swin +- Electra +- Encoder Decoder +- Falcon +- Flaubert +- GLM-4 +- GPT-2 +- GPT-BigCode +- GPT-J +- GPT-Neo +- GPT-NeoX +- GPT-NeoX-Japanese +- Gemma +- Hubert +- IBert +- InternLM +- InternLM2 +- Levit +- Llama +- LongT5 +- M2-M100 +- MBart +- MPNet +- MPT +- MT5 +- Marian +- MiniCPM +- Mistral +- Mixtral +- MobileBert +- MobileNet v1 +- MobileNet v2 +- MobileVit +- Nystromformer +- OLMo +- OPT +- Orion +- Pegasus +- Perceiver +- Persimmon +- Phi +- Phi3 +- Pix2Struct +- PoolFormer +- Qwen +- Qwen2(Qwen1.5) +- ResNet +- Roberta +- Roformer +- SEW +- SEW-D +- Segformer +- SqueezeBert +- StableLM +- StarCoder2 +- Swin +- T5 +- TROCR +- UniSpeech +- UniSpeech SAT +- Vision Encoder Decoder +- Vit +- Wav2Vec2 +- Wav2Vec2 Conformer +- WavLM +- Whisper +- XGLM +- XLM +- XLM-Roberta +- XVERSE + +## [Diffusers](https://huggingface.co/docs/diffusers/index): +- Stable Diffusion +- Stable Diffusion XL +- Latent Consistency + +## [Timm](https://huggingface.co/docs/timm/index): +- PiT +- ViT + +## [Sentence Transformers](https://github.com/UKPLab/sentence-transformers): +- All Transformer and CLIP-based models. \ No newline at end of file diff --git a/tests/openvino/test_modeling.py b/tests/openvino/test_modeling.py index b52d2094c..59d642fcf 100644 --- a/tests/openvino/test_modeling.py +++ b/tests/openvino/test_modeling.py @@ -307,21 +307,16 @@ class OVModelForSequenceClassificationIntegrationTest(unittest.TestCase): SUPPORTED_ARCHITECTURES = ( "albert", "bert", - # "camembert", "convbert", - # "data2vec_text", - # "deberta_v2", "distilbert", "electra", "flaubert", "ibert", - # "mobilebert", - # "nystromformer", + "nystromformer", "roberta", "roformer", "squeezebert", "xlm", - # "xlm_roberta", ) @parameterized.expand(SUPPORTED_ARCHITECTURES) @@ -348,6 +343,8 @@ def test_compare_to_transformers(self, model_arch): gc.collect() @parameterized.expand(SUPPORTED_ARCHITECTURES) + @pytest.mark.run_slow + @slow def test_pipeline(self, model_arch): set_seed(SEED) model_id = MODEL_NAMES[model_arch] @@ -1013,16 +1010,18 @@ class OVModelForMaskedLMIntegrationTest(unittest.TestCase): SUPPORTED_ARCHITECTURES = ( "albert", "bert", - # "camembert", - # "convbert", - # "data2vec_text", + "camembert", + "convbert", + "data2vec_text", "deberta", - # "deberta_v2", + "deberta_v2", "distilbert", "electra", "flaubert", "ibert", - # "mobilebert", + "mobilebert", + "mpnet", + "perceiver_text", "roberta", "roformer", "squeezebert", @@ -1079,16 +1078,19 @@ class OVModelForImageClassificationIntegrationTest(unittest.TestCase): SUPPORTED_ARCHITECTURES = ( "beit", "convnext", - # "data2vec_vision", - # "deit", + "convnextv2", + "data2vec_vision", + "deit", "levit", "mobilenet_v1", "mobilenet_v2", "mobilevit", - # "poolformer", + "poolformer", + "perceiver_vision", "resnet", - # "segformer", - # "swin", + "segformer", + "swin", + "donut-swin", "vit", ) @@ -1182,7 +1184,7 @@ class OVModelForSeq2SeqLMIntegrationTest(unittest.TestCase): # "bigbird_pegasus", "blenderbot", "blenderbot-small", - # "longt5", + "longt5", "m2m_100", "marian", "mbart", @@ -1225,6 +1227,8 @@ def test_compare_to_transformers(self, model_arch): gc.collect() @parameterized.expand(SUPPORTED_ARCHITECTURES) + @pytest.mark.run_slow + @slow def test_pipeline(self, model_arch): set_seed(SEED) model_id = MODEL_NAMES[model_arch] @@ -1320,17 +1324,17 @@ def test_compare_with_and_without_past_key_values(self): class OVModelForAudioClassificationIntegrationTest(unittest.TestCase): SUPPORTED_ARCHITECTURES = ( - # "audio_spectrogram_transformer", - # "data2vec_audio", - # "hubert", - # "sew", - # "sew_d", - # "wav2vec2-conformer", + "audio_spectrogram_transformer", + "data2vec_audio", + "hubert", + "sew", + "sew_d", "unispeech", - # "unispeech_sat", - # "wavlm", + "unispeech_sat", + "wavlm", "wav2vec2", - # "wav2vec2-conformer", + "wav2vec2-conformer", + "whisper", ) def _generate_random_audio_data(self): @@ -1366,6 +1370,8 @@ def test_compare_to_transformers(self, model_arch): gc.collect() @parameterized.expand(SUPPORTED_ARCHITECTURES) + @pytest.mark.run_slow + @slow def test_pipeline(self, model_arch): set_seed(SEED) model_id = MODEL_NAMES[model_arch] @@ -1684,6 +1690,8 @@ def test_compare_to_transformers(self, model_arch): gc.collect() @parameterized.expand(SUPPORTED_ARCHITECTURES) + @pytest.mark.run_slow + @slow def test_pipeline(self, model_arch): set_seed(SEED) model_id = MODEL_NAMES[model_arch] @@ -1790,6 +1798,8 @@ def test_compare_to_transformers(self, model_arch: str): gc.collect() @parameterized.expand(SUPPORTED_ARCHITECTURES) + @pytest.mark.run_slow + @slow def test_pipeline(self, model_arch: str): set_seed(SEED) model_id = MODEL_NAMES[model_arch] diff --git a/tests/openvino/utils_tests.py b/tests/openvino/utils_tests.py index 09919047c..760a9bfb6 100644 --- a/tests/openvino/utils_tests.py +++ b/tests/openvino/utils_tests.py @@ -46,8 +46,11 @@ "deberta_v2": "hf-internal-testing/tiny-random-DebertaV2Model", "deit": "hf-internal-testing/tiny-random-deit", "convnext": "hf-internal-testing/tiny-random-convnext", + "convnextv2": "hf-internal-testing/tiny-random-ConvNextV2Model", "distilbert": "hf-internal-testing/tiny-random-distilbert", "donut": "fxmarty/tiny-doc-qa-vision-encoder-decoder", + "donut-swin": "hf-internal-testing/tiny-random-DonutSwinModel", + "detr": "hf-internal-testing/tiny-random-DetrModel", "electra": "hf-internal-testing/tiny-random-electra", "gemma": "fxmarty/tiny-random-GemmaForCausalLM", "falcon": "fxmarty/really-tiny-falcon-testing", @@ -82,11 +85,14 @@ "mobilenet_v2": "hf-internal-testing/tiny-random-MobileNetV2Model", "mobilevit": "hf-internal-testing/tiny-random-mobilevit", "mpt": "hf-internal-testing/tiny-random-MptForCausalLM", + "mpnet": "hf-internal-testing/tiny-random-MPNetModel", "mt5": "stas/mt5-tiny-random", "nystromformer": "hf-internal-testing/tiny-random-NystromformerModel", "olmo": "katuni4ka/tiny-random-olmo-hf", "orion": "katuni4ka/tiny-random-orion", "pegasus": "hf-internal-testing/tiny-random-pegasus", + "perceiver_text": "hf-internal-testing/tiny-random-language_perceiver", + "perceiver_vision": "hf-internal-testing/tiny-random-vision_perceiver_conv", "persimmon": "hf-internal-testing/tiny-random-PersimmonForCausalLM", "pix2struct": "fxmarty/pix2struct-tiny-random", "phi": "echarlaix/tiny-random-PhiForCausalLM", From 1c5f98c8e0beed7a52bb9d8fc2ebf9ee421b8b2b Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Mon, 24 Jun 2024 19:07:12 +0200 Subject: [PATCH 16/27] rename section --- docs/source/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 2be5ba126..a83f6fc00 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -15,7 +15,7 @@ - local: openvino/export title: Export - local: openvino/architectures - title: Architectures + title: Supported Models - local: openvino/inference title: Inference - local: openvino/optimization From 8d6fd2e22d73675eecd560c78249050ff4564d9f Mon Sep 17 00:00:00 2001 From: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> Date: Tue, 25 Jun 2024 10:49:49 +0200 Subject: [PATCH 17/27] Update docs/source/openvino/architectures.mdx Co-authored-by: Ekaterina Aidova --- docs/source/openvino/architectures.mdx | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/source/openvino/architectures.mdx b/docs/source/openvino/architectures.mdx index c59b92e47..6e80e2a7d 100644 --- a/docs/source/openvino/architectures.mdx +++ b/docs/source/openvino/architectures.mdx @@ -15,8 +15,6 @@ Here is the list of the supported architectures : ## [Transformers](https://huggingface.co/docs/transformers/index): - Albert -- Albert -- Aquila - Aquila - Arctic - Audio Spectrogram Transformer From 891dc6721cd761b3b16263f95bcdb4d9f3b61695 Mon Sep 17 00:00:00 2001 From: Ella Charlaix <80481427+echarlaix@users.noreply.github.com> Date: Tue, 25 Jun 2024 10:50:52 +0200 Subject: [PATCH 18/27] Update optimum/commands/export/openvino.py Co-authored-by: Helena Kloosterman --- optimum/commands/export/openvino.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimum/commands/export/openvino.py b/optimum/commands/export/openvino.py index 97c3d6368..88b37d9ef 100644 --- a/optimum/commands/export/openvino.py +++ b/optimum/commands/export/openvino.py @@ -80,7 +80,7 @@ def parse_args_openvino(parser: "ArgumentParser"): type=str, choices=["transformers", "diffusers", "timm", "sentence_transformers"], default=None, - help="The library used to laod the model before export. If not provided, will attempt to infer the local checkpoint's library", + help="The library used to load the model before export. If not provided, will attempt to infer the local checkpoint's library", ) optional_group.add_argument( "--cache_dir", From e709d2563f621c51bedf6b0ddd20d31a8d27396b Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Tue, 25 Jun 2024 14:27:49 +0200 Subject: [PATCH 19/27] fix sections --- docs/source/_toctree.yml | 4 ++-- docs/source/openvino/export.mdx | 8 ++++---- docs/source/openvino/{architectures.mdx => models.mdx} | 10 ++++------ 3 files changed, 10 insertions(+), 12 deletions(-) rename docs/source/openvino/{architectures.mdx => models.mdx} (93%) diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index a83f6fc00..1fd3fe6b7 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -14,12 +14,12 @@ - sections: - local: openvino/export title: Export - - local: openvino/architectures - title: Supported Models - local: openvino/inference title: Inference - local: openvino/optimization title: Optimization + - local: openvino/models + title: Supported Models - local: openvino/reference title: Reference title: OpenVINO diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index c6e228b7d..4101b0c98 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -70,7 +70,7 @@ Optional arguments: --pad-token-id PAD_TOKEN_ID This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it. --ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80% of the layers will be quantized to int4 while - 20% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. + 20% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. Default value is 0.8. --sym Whether to apply symmetric quantization --group-size GROUP_SIZE The group size to use for int4 quantization. Recommended value is 128 and -1 will results in per-column quantization. @@ -94,7 +94,7 @@ Optional arguments: Do not add converted tokenizer and detokenizer OpenVINO models. ``` -#### Quantization +### Quantization You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`: @@ -113,7 +113,7 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with Once the model is exported, you can now [load your OpenVINO model](inference) by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. -### When loading your model +## When loading your model You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model. @@ -124,7 +124,7 @@ model = OVModelForCausalLM.from_pretrained("gpt2", export=True) model.save_pretrained("ov_model") ``` -### After loading your model +## After loading your model ```python from transfomers import AutoModelForCausalLM diff --git a/docs/source/openvino/architectures.mdx b/docs/source/openvino/models.mdx similarity index 93% rename from docs/source/openvino/architectures.mdx rename to docs/source/openvino/models.mdx index 6e80e2a7d..2eafa5aa5 100644 --- a/docs/source/openvino/architectures.mdx +++ b/docs/source/openvino/models.mdx @@ -7,12 +7,10 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Overview - πŸ€— Optimum handles the export of models to OpenVINO in the `exporters.openvino` module. It provides classes, functions, and a command line interface to perform the export easily. Here is the list of the supported architectures : -## [Transformers](https://huggingface.co/docs/transformers/index): +## [Transformers](https://huggingface.co/docs/transformers/index) - Albert - Aquila @@ -114,14 +112,14 @@ Here is the list of the supported architectures : - XLM-Roberta - XVERSE -## [Diffusers](https://huggingface.co/docs/diffusers/index): +## [Diffusers](https://huggingface.co/docs/diffusers/index) - Stable Diffusion - Stable Diffusion XL - Latent Consistency -## [Timm](https://huggingface.co/docs/timm/index): +## [Timm](https://huggingface.co/docs/timm/index) - PiT - ViT -## [Sentence Transformers](https://github.com/UKPLab/sentence-transformers): +## [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) - All Transformer and CLIP-based models. \ No newline at end of file From 6896d1eaa5289145072bade81568b183cbc06500 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Tue, 25 Jun 2024 14:28:00 +0200 Subject: [PATCH 20/27] add back table --- docs/source/openvino/inference.mdx | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx index f84e59ae4..0afdb315b 100644 --- a/docs/source/openvino/inference.mdx +++ b/docs/source/openvino/inference.mdx @@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License. # Inference -Optimum Intel can be used to load optimized models from the [Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime without rewriting your APIs. +Optimum Intel can be used to load optimized models from the [Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices) ## Transformers models @@ -39,6 +39,25 @@ model.save_pretrained(save_directory) tokenizer.save_pretrained(save_directory) ``` +As shown in the table below, each task is associated with a class enabling to automatically load your model. + +| Task | Auto Class | +|--------------------------------------|--------------------------------------| +| `text-classification` | `OVModelForSequenceClassification` | +| `token-classification` | `OVModelForTokenClassification` | +| `question-answering` | `OVModelForQuestionAnswering` | +| `audio-classification` | `OVModelForAudioClassification` | +| `image-classification` | `OVModelForImageClassification` | +| `feature-extraction` | `OVModelForFeatureExtraction` | +| `fill-mask` | `OVModelForMaskedLM` | +| `image-classification` | `OVModelForImageClassification` | +| `audio-classification` | `OVModelForAudioClassification` | +| `text-generation-with-past` | `OVModelForCausalLM` | +| `text2text-generation-with-past` | `OVModelForSeq2SeqLM` | +| `automatic-speech-recognition` | `OVModelForSpeechSeq2Seq` | +| `image-to-text` | `OVModelForVision2Seq` | + + ### Weight-only quantization You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when loading your model to reduce the memory footprint and inference latency. From d60b8ed947867866226d9d879cb56f575e236276 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Tue, 25 Jun 2024 14:28:15 +0200 Subject: [PATCH 21/27] add default ratio value message --- optimum/commands/export/openvino.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/optimum/commands/export/openvino.py b/optimum/commands/export/openvino.py index 88b37d9ef..5adcb3649 100644 --- a/optimum/commands/export/openvino.py +++ b/optimum/commands/export/openvino.py @@ -102,7 +102,7 @@ def parse_args_openvino(parser: "ArgumentParser"): default=None, help=( "A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80%% of the layers will be quantized to int4 " - "while 20%% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency." + "while 20%% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. Default value is 0.8." ), ) optional_group.add_argument( From c4385ac17a86fec60bddf65b19dba2b8a6d781c0 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Tue, 25 Jun 2024 14:48:02 +0200 Subject: [PATCH 22/27] update list supported models --- docs/source/openvino/models.mdx | 3 --- tests/openvino/test_modeling.py | 18 +++++++++++------- tests/openvino/test_quantization.py | 2 +- 3 files changed, 12 insertions(+), 11 deletions(-) diff --git a/docs/source/openvino/models.mdx b/docs/source/openvino/models.mdx index 2eafa5aa5..523928b8d 100644 --- a/docs/source/openvino/models.mdx +++ b/docs/source/openvino/models.mdx @@ -32,7 +32,6 @@ Here is the list of the supported architectures : - Cohere - ConvBert - ConvNext -- ConvNextV2 - DBRX - Data2VecAudio - Data2VecText @@ -41,7 +40,6 @@ Here is the list of the supported architectures : - Deberta-v2 - Deit - DistilBert -- Donut-Swin - Electra - Encoder Decoder - Falcon @@ -60,7 +58,6 @@ Here is the list of the supported architectures : - InternLM2 - Levit - Llama -- LongT5 - M2-M100 - MBart - MPNet diff --git a/tests/openvino/test_modeling.py b/tests/openvino/test_modeling.py index 59d642fcf..831817218 100644 --- a/tests/openvino/test_modeling.py +++ b/tests/openvino/test_modeling.py @@ -312,7 +312,6 @@ class OVModelForSequenceClassificationIntegrationTest(unittest.TestCase): "electra", "flaubert", "ibert", - "nystromformer", "roberta", "roformer", "squeezebert", @@ -1021,6 +1020,7 @@ class OVModelForMaskedLMIntegrationTest(unittest.TestCase): "ibert", "mobilebert", "mpnet", + "nystromformer", "perceiver_text", "roberta", "roformer", @@ -1035,6 +1035,7 @@ def test_compare_to_transformers(self, model_arch): set_seed(SEED) ov_model = OVModelForMaskedLM.from_pretrained(model_id, export=True, ov_config=F32_CONFIG) self.assertIsInstance(ov_model.config, PretrainedConfig) + set_seed(SEED) transformers_model = AutoModelForMaskedLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) inputs = f"This is a sample {tokenizer.mask_token}" @@ -1054,17 +1055,18 @@ def test_compare_to_transformers(self, model_arch): @parameterized.expand(SUPPORTED_ARCHITECTURES) def test_pipeline(self, model_arch): - set_seed(SEED) model_id = MODEL_NAMES[model_arch] + set_seed(SEED) model = OVModelForMaskedLM.from_pretrained(model_id, export=True) model.eval() tokenizer = AutoTokenizer.from_pretrained(model_id) pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer) inputs = f"This is a {tokenizer.mask_token}." + set_seed(SEED) outputs = pipe(inputs) self.assertEqual(pipe.device, model.device) self.assertTrue(all(item["score"] > 0.0 for item in outputs)) - + set_seed(SEED) ov_pipe = optimum_pipeline("fill-mask", model_id, accelerator="openvino") ov_outputs = ov_pipe(inputs) self.assertEqual(outputs[-1]["score"], ov_outputs[-1]["score"]) @@ -1078,7 +1080,7 @@ class OVModelForImageClassificationIntegrationTest(unittest.TestCase): SUPPORTED_ARCHITECTURES = ( "beit", "convnext", - "convnextv2", + # "convnextv2", "data2vec_vision", "deit", "levit", @@ -1090,7 +1092,6 @@ class OVModelForImageClassificationIntegrationTest(unittest.TestCase): "resnet", "segformer", "swin", - "donut-swin", "vit", ) @@ -1102,6 +1103,7 @@ def test_compare_to_transformers(self, model_arch): set_seed(SEED) ov_model = OVModelForImageClassification.from_pretrained(model_id, export=True, ov_config=F32_CONFIG) self.assertIsInstance(ov_model.config, PretrainedConfig) + set_seed(SEED) transformers_model = AutoModelForImageClassification.from_pretrained(model_id) preprocessor = AutoFeatureExtractor.from_pretrained(model_id) url = "http://images.cocodataset.org/val2017/000000039769.jpg" @@ -1184,7 +1186,7 @@ class OVModelForSeq2SeqLMIntegrationTest(unittest.TestCase): # "bigbird_pegasus", "blenderbot", "blenderbot-small", - "longt5", + # "longt5", "m2m_100", "marian", "mbart", @@ -1334,7 +1336,6 @@ class OVModelForAudioClassificationIntegrationTest(unittest.TestCase): "wavlm", "wav2vec2", "wav2vec2-conformer", - "whisper", ) def _generate_random_audio_data(self): @@ -1350,6 +1351,7 @@ def test_compare_to_transformers(self, model_arch): set_seed(SEED) ov_model = OVModelForAudioClassification.from_pretrained(model_id, export=True, ov_config=F32_CONFIG) self.assertIsInstance(ov_model.config, PretrainedConfig) + set_seed(SEED) transformers_model = AutoModelForAudioClassification.from_pretrained(model_id) preprocessor = AutoFeatureExtractor.from_pretrained(model_id) inputs = preprocessor(self._generate_random_audio_data(), return_tensors="pt") @@ -1380,11 +1382,13 @@ def test_pipeline(self, model_arch): preprocessor = AutoFeatureExtractor.from_pretrained(model_id) pipe = pipeline("audio-classification", model=model, feature_extractor=preprocessor) inputs = [np.random.random(16000)] + set_seed(SEED) outputs = pipe(inputs) self.assertEqual(pipe.device, model.device) self.assertTrue(all(item["score"] > 0.0 for item in outputs[0])) ov_pipe = optimum_pipeline("audio-classification", model_id, accelerator="openvino") + set_seed(SEED) ov_outputs = ov_pipe(inputs) self.assertEqual(outputs[-1][-1]["score"], ov_outputs[-1][-1]["score"]) del ov_pipe diff --git a/tests/openvino/test_quantization.py b/tests/openvino/test_quantization.py index df727eb10..67970fbbc 100644 --- a/tests/openvino/test_quantization.py +++ b/tests/openvino/test_quantization.py @@ -428,7 +428,7 @@ def test_ovmodel_hybrid_quantization_with_custom_dataset( model = model_cls.from_pretrained(model_id, export=True) quantizer = OVQuantizer(model) quantization_config = OVWeightQuantizationConfig(bits=8, num_samples=3, quant_method="hybrid") - self.assertIsInstance(quantization_config.quant_method, OVQuantizationMethod.HYBRID) + self.assertEqual(quantization_config.quant_method, OVQuantizationMethod.HYBRID) quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config), calibration_dataset=dataset) num_fake_quantize, num_int8, num_int4 = get_num_quantized_nodes(model.unet) From af16d860c0ea54c5785425cac8bb6a97748d2978 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Tue, 25 Jun 2024 15:47:52 +0200 Subject: [PATCH 23/27] set seed --- tests/openvino/test_modeling.py | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/tests/openvino/test_modeling.py b/tests/openvino/test_modeling.py index 831817218..883282e7d 100644 --- a/tests/openvino/test_modeling.py +++ b/tests/openvino/test_modeling.py @@ -1062,7 +1062,6 @@ def test_pipeline(self, model_arch): tokenizer = AutoTokenizer.from_pretrained(model_id) pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer) inputs = f"This is a {tokenizer.mask_token}." - set_seed(SEED) outputs = pipe(inputs) self.assertEqual(pipe.device, model.device) self.assertTrue(all(item["score"] > 0.0 for item in outputs)) @@ -1094,7 +1093,6 @@ class OVModelForImageClassificationIntegrationTest(unittest.TestCase): "swin", "vit", ) - TIMM_MODELS = ("timm/pit_s_distilled_224.in1k", "timm/vit_tiny_patch16_224.augreg_in21k") @parameterized.expand(SUPPORTED_ARCHITECTURES) @@ -1133,13 +1131,12 @@ def test_pipeline(self, model_arch): preprocessor = AutoFeatureExtractor.from_pretrained(model_id) pipe = pipeline("image-classification", model=model, feature_extractor=preprocessor) inputs = "http://images.cocodataset.org/val2017/000000039769.jpg" - set_seed(SEED) outputs = pipe(inputs) self.assertEqual(pipe.device, model.device) self.assertGreaterEqual(outputs[0]["score"], 0.0) self.assertTrue(isinstance(outputs[0]["label"], str)) - ov_pipe = optimum_pipeline("image-classification", model_id, accelerator="openvino") set_seed(SEED) + ov_pipe = optimum_pipeline("image-classification", model_id, accelerator="openvino") ov_outputs = ov_pipe(inputs) self.assertEqual(outputs[-1]["score"], ov_outputs[-1]["score"]) del ov_pipe @@ -1382,13 +1379,11 @@ def test_pipeline(self, model_arch): preprocessor = AutoFeatureExtractor.from_pretrained(model_id) pipe = pipeline("audio-classification", model=model, feature_extractor=preprocessor) inputs = [np.random.random(16000)] - set_seed(SEED) outputs = pipe(inputs) self.assertEqual(pipe.device, model.device) self.assertTrue(all(item["score"] > 0.0 for item in outputs[0])) - - ov_pipe = optimum_pipeline("audio-classification", model_id, accelerator="openvino") set_seed(SEED) + ov_pipe = optimum_pipeline("audio-classification", model_id, accelerator="openvino") ov_outputs = ov_pipe(inputs) self.assertEqual(outputs[-1][-1]["score"], ov_outputs[-1][-1]["score"]) del ov_pipe From fd535021369c6017aeb3da7e2951463b3a397286 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Tue, 25 Jun 2024 15:48:01 +0200 Subject: [PATCH 24/27] move to export section --- docs/source/openvino/export.mdx | 13 ++++++++----- docs/source/openvino/inference.mdx | 9 --------- 2 files changed, 8 insertions(+), 14 deletions(-) diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index 4101b0c98..59683d2e6 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -94,8 +94,6 @@ Optional arguments: Do not add converted tokenizer and detokenizer OpenVINO models. ``` -### Quantization - You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`: ```bash @@ -111,17 +109,20 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with -Once the model is exported, you can now [load your OpenVINO model](inference) by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. - ## When loading your model You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model. +To easily save the resulting model, you can use the `save_pretrained()` method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model. + ```python from optimum.intel import OVModelForCausalLM model = OVModelForCausalLM.from_pretrained("gpt2", export=True) -model.save_pretrained("ov_model") + +save_directory = "ov_model" +model.save_pretrained(save_directory) +tokenizer.save_pretrained(save_directory) ``` ## After loading your model @@ -133,3 +134,5 @@ from optimum.exporters.openvino import export_from_model model = AutoModelForCausalLM.from_pretrained("gpt2") export_from_model(model, output="ov_model", task="text-generation-with-past") ``` + +Once the model is exported, you can now [load your OpenVINO model](inference) by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx index 0afdb315b..b4de6f947 100644 --- a/docs/source/openvino/inference.mdx +++ b/docs/source/openvino/inference.mdx @@ -30,15 +30,6 @@ Once [your model was exported](export), you can load it by replacing the `AutoMo See the [reference documentation](reference) for more information about parameters, and examples for different tasks. -To easily save the resulting model, you can use the `save_pretrained()` method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model. - -```python -# Save your model -save_directory = "openvino_distilbert" -model.save_pretrained(save_directory) -tokenizer.save_pretrained(save_directory) -``` - As shown in the table below, each task is associated with a class enabling to automatically load your model. | Task | Auto Class | From e087df4daacdad6a2a0f482abbc9c126014905f1 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Tue, 25 Jun 2024 15:51:57 +0200 Subject: [PATCH 25/27] udpate model test --- tests/openvino/utils_tests.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/openvino/utils_tests.py b/tests/openvino/utils_tests.py index 760a9bfb6..590beefb3 100644 --- a/tests/openvino/utils_tests.py +++ b/tests/openvino/utils_tests.py @@ -44,7 +44,7 @@ "dbrx": "katuni4ka/tiny-random-dbrx", "deberta": "hf-internal-testing/tiny-random-deberta", "deberta_v2": "hf-internal-testing/tiny-random-DebertaV2Model", - "deit": "hf-internal-testing/tiny-random-deit", + "deit": "hf-internal-testing/tiny-random-DeiTModel", "convnext": "hf-internal-testing/tiny-random-convnext", "convnextv2": "hf-internal-testing/tiny-random-ConvNextV2Model", "distilbert": "hf-internal-testing/tiny-random-distilbert", From d433fafca202b6ac6e1bb0a354e7764ecad33299 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Tue, 25 Jun 2024 16:51:03 +0200 Subject: [PATCH 26/27] fix beam search test for glm4 --- tests/openvino/test_modeling.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tests/openvino/test_modeling.py b/tests/openvino/test_modeling.py index 883282e7d..c7a381a0e 100644 --- a/tests/openvino/test_modeling.py +++ b/tests/openvino/test_modeling.py @@ -991,13 +991,15 @@ def test_beam_search(self, model_arch): for gen_config in gen_configs: if gen_config.do_sample and model_arch in ["baichuan2-13b", "olmo"]: continue - + set_seed(SEED) transformers_outputs = transformers_model.generate(**tokens, generation_config=gen_config) + set_seed(SEED) ov_stateful_outputs = ov_model_stateful.generate(**tokens, generation_config=gen_config) self.assertTrue( torch.equal(ov_stateful_outputs, transformers_outputs), f"generation config : {gen_config}, transformers output {transformers_outputs}, ov_model_stateful output {ov_stateful_outputs}", ) + set_seed(SEED) ov_stateless_outputs = ov_model_stateless.generate(**tokens, generation_config=gen_config) self.assertTrue( torch.equal(ov_stateless_outputs, transformers_outputs), From 003902eac91e54a3d0c3412cef824b7e52e28b50 Mon Sep 17 00:00:00 2001 From: Ella Charlaix Date: Tue, 25 Jun 2024 16:51:25 +0200 Subject: [PATCH 27/27] update code snippet --- docs/source/openvino/export.mdx | 26 ++++++++++++++++---------- docs/source/openvino/inference.mdx | 4 +--- 2 files changed, 17 insertions(+), 13 deletions(-) diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx index 59683d2e6..8cffd0612 100644 --- a/docs/source/openvino/export.mdx +++ b/docs/source/openvino/export.mdx @@ -58,13 +58,13 @@ Optional arguments: 'masked-im', 'automatic-speech-recognition', 'fill-mask', 'image-to-text', 'text-generation', 'zero-shot-object-detection', 'multiple-choice', 'object-detection', 'stable- diffusion', 'audio-xvector', 'text-to-audio', 'zero-shot-image-classification', 'token-classification', 'image-classification', 'depth-estimation', 'image-to-image', 'audio- frame-classification', 'semantic-segmentation', 'text-classification']. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder. - --framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local checkpoint's original framework or what is available in the environment. + --framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local checkpoints original framework or what is available in the environment. --trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code present in the model repository. --weight-format {fp32,fp16,int8,int4} The weight format of the exported model. --library {transformers,diffusers,timm,sentence_transformers} - The library used to load the model before export. If not provided, will attempt to infer the local checkpoint's library. + The library used to load the model before export. If not provided, will attempt to infer the local checkpoints library. --cache_dir CACHE_DIR The path to a directory in which the downloaded model should be cached if the standard cache should not be used. --pad-token-id PAD_TOKEN_ID @@ -78,7 +78,7 @@ Optional arguments: ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models. --all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied, they are compressed to INT8. --awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but requires additional time for tuning weights on a calibration dataset. To run AWQ, - please also provide a dataset argument. Note: it's possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped. + please also provide a dataset argument. Note: it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped. --scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed layers. Providing a dataset is required to run scale estimation. Please note, that applying scale estimation takes additional memory and time. --sensitivity-metric SENSITIVITY_METRIC @@ -115,20 +115,26 @@ You can also load your PyTorch checkpoint and convert it to the OpenVINO format To easily save the resulting model, you can use the `save_pretrained()` method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model. -```python -from optimum.intel import OVModelForCausalLM -model = OVModelForCausalLM.from_pretrained("gpt2", export=True) +```diff +- from transformers import AutoModelForCausalLM ++ from optimum.intel import OVModelForCausalLM + from transformers import AutoTokenizer + + model_id = "gpt2" +- model = AutoModelForCausalLM.from_pretrained(model_id) ++ model = OVModelForCausalLM.from_pretrained(model_id, export=True) + tokenizer = AutoTokenizer.from_pretrained(model_id) -save_directory = "ov_model" -model.save_pretrained(save_directory) -tokenizer.save_pretrained(save_directory) + save_directory = "ov_model" + model.save_pretrained(save_directory) + tokenizer.save_pretrained(save_directory) ``` ## After loading your model ```python -from transfomers import AutoModelForCausalLM +from transformers import AutoModelForCausalLM from optimum.exporters.openvino import export_from_model model = AutoModelForCausalLM.from_pretrained("gpt2") diff --git a/docs/source/openvino/inference.mdx b/docs/source/openvino/inference.mdx index b4de6f947..822d6e2f9 100644 --- a/docs/source/openvino/inference.mdx +++ b/docs/source/openvino/inference.mdx @@ -170,14 +170,12 @@ tokenizer.save_pretrained(save_directory) ## Diffusers models -Make sure you have πŸ€— Diffusers installed. +Make sure you have πŸ€— Diffusers installed. To install `diffusers`: -To install `diffusers`: ```bash pip install optimum[diffusers] ``` - ### Stable Diffusion Stable Diffusion models can also be used when running inference with OpenVINO. When Stable Diffusion models