huggingface · echarlaix · Jun 25, 2024 · Jun 13, 2024 · Jun 13, 2024 · Jun 13, 2024
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -4,19 +4,23 @@
  - local: installation
  title: Installation
  - sections:
- - local: optimization_inc
+ - local: neural_compressor/optimization
  title: Optimization
- - local: distributed_training
+ - local: neural_compressor/distributed_training
  title: Distributed Training
- - local: reference_inc
+ - local: neural_compressor/reference
  title: Reference
  title: Neural Compressor
  - sections:
- - local: inference
- title: Models for inference
- - local: optimization_ov
+ - local: openvino/export
+ title: Export
+ - local: openvino/inference
+ title: Inference
+ - local: openvino/optimization
  title: Optimization
- - local: reference_ov
+ - local: openvino/models
+ title: Supported Models
+ - local: openvino/reference
  title: Reference
  title: OpenVINO
  title: Optimum Intel

diff --git a/docs/source/index.mdx b/docs/source/index.mdx
@@ -25,11 +25,11 @@ limitations under the License.
 
 <div class="mt-10">
  <div class="w-full flex flex-col space-x-4 md:grid md:grid-cols-2 md:gap-x-5">
- <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./optimization_inc"
+ <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="neural_compressor/optimization"
  ><div class="w-full text-center bg-gradient-to-br from-blue-400 to-blue-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">Neural Compressor</div>
  <p class="text-gray-700">Learn how to apply compression techniques such as quantization, pruning and knowledge distillation to speed up inference with Intel Neural Compressor.</p>
  </a>
- <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="./inference"
+ <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg" href="openvino/export"
  ><div class="w-full text-center bg-gradient-to-br from-purple-400 to-purple-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">OpenVINO</div>
  <p class="text-gray-700">Learn how to run inference with OpenVINO Runtime and to apply quantization, pruning and knowledge distillation on your model to further speed up inference.</p>
  </a>

diff --git a/docs/source/distributed_training.mdx → ...eural_compressor/distributed_training.mdx b/docs/source/distributed_training.mdx → ...eural_compressor/distributed_training.mdx
diff --git a/docs/source/optimization_inc.mdx → ...source/neural_compressor/optimization.mdx b/docs/source/optimization_inc.mdx → ...source/neural_compressor/optimization.mdx
@@ -16,7 +16,7 @@ Optimum Intel can be used to apply popular compression techniques such as quanti
 
 ## Post-training optimization
 
-Post-training compression techniques such as dynamic and static quantization can be easily applied on your model using our [`INCQuantizer`](https://huggingface.co/docs/optimum/intel_optimization#optimum.intel.neural_compressor.IncQuantizer).
+Post-training compression techniques such as dynamic and static quantization can be easily applied on your model using our [`INCQuantizer`](optimization).
 Note that quantization is currently only supported for CPUs (only CPU backends are available), so we will not be utilizing GPUs / CUDA in the following examples.
 
 ### Dynamic quantization
@@ -252,7 +252,7 @@ To know more about the different supported methodologies, you can refer to the N
 
 ## Loading a quantized model
 
-To load a quantized model hosted locally or on the 🤗 hub, you must instantiate you model using our [`INCModelForXxx`](https://huggingface.co/docs/optimum/main/intel/reference_inc#optimum.intel.neural_compressor.quantization.INCModel) classes.
+To load a quantized model hosted locally or on the 🤗 hub, you must instantiate you model using our [`INCModelForXxx`](reference) classes.
 
 ```python
 from optimum.intel import INCModelForSequenceClassification

diff --git a/docs/source/reference_inc.mdx → docs/source/neural_compressor/reference.mdx b/docs/source/reference_inc.mdx → docs/source/neural_compressor/reference.mdx
diff --git a/docs/source/openvino/export.mdx b/docs/source/openvino/export.mdx
@@ -0,0 +1,144 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Export your model
+
+## Using the CLI
+
+To export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
+
+```bash
+optimum-cli export openvino --model gpt2 ov_model/
+```
+
+The model argument can either be the model ID of a model hosted on the [Hub](https://huggingface.co/models) or a path to a model hosted locally. For local models, you need to specify the task for which the model should be loaded before export, among the list of the [supported tasks](https://huggingface.co/docs/optimum/main/en/exporters/task_manager).
+
+
+```bash
+optimum-cli export openvino --model local_model_dir --task text-generation-with-past ov_model/
+```
+
+The `-with-past` suffix enable the re-use of past keys and values. This allows to avoid recomputing the same intermediate activations during the generation. to export the model without, you will need to remove this suffix.
+
+| With K-V cache | Without K-V cache |
+|------------------------------------------|--------------------------------------|
+| `text-generation-with-past` | `text-generation` |
+| `text2text-generation-with-past` | `text2text-generation` |
+| `automatic-speech-recognition-with-past` | `automatic-speech-recognition` |
+
+
+Check out the help for more options:
+
+```bash
+optimum-cli export openvino --help
+
+usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code] [--weight-format {fp32,fp16,int8,int4}]
+ [--library {transformers,diffusers,timm,sentence_transformers}] [--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym]
+ [--group-size GROUP_SIZE] [--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--sensitivity-metric SENSITIVITY_METRIC] [--num-samples NUM_SAMPLES]
+ [--disable-stateful] [--disable-convert-tokenizer]
+ output
+
+optional arguments:
+ -h, --help show this help message and exit
+
+Required arguments:
+ --model MODEL Model ID on huggingface.co or path on disk to load model from.
+
+ output Path indicating the directory where to store the generated OV model.
+
+Optional arguments:
+ --task TASK The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['image-segmentation',
+ 'feature-extraction', 'mask-generation', 'audio-classification', 'conversational', 'stable-diffusion-xl', 'question-answering', 'sentence-similarity', 'text2text-generation',
+ 'masked-im', 'automatic-speech-recognition', 'fill-mask', 'image-to-text', 'text-generation', 'zero-shot-object-detection', 'multiple-choice', 'object-detection', 'stable-
+ diffusion', 'audio-xvector', 'text-to-audio', 'zero-shot-image-classification', 'token-classification', 'image-classification', 'depth-estimation', 'image-to-image', 'audio-
+ frame-classification', 'semantic-segmentation', 'text-classification']. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder.
+ --framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local checkpoints original framework or what is available in the environment.
+ --trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it
+ will execute on your local machine arbitrary code present in the model repository.
+ --weight-format {fp32,fp16,int8,int4}
+ The weight format of the exported model.
+ --library {transformers,diffusers,timm,sentence_transformers}
+ The library used to load the model before export. If not provided, will attempt to infer the local checkpoints library.
+ --cache_dir CACHE_DIR
+ The path to a directory in which the downloaded model should be cached if the standard cache should not be used.
+ --pad-token-id PAD_TOKEN_ID
+ This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.
+ --ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit quantization. If set to 0.8, 80% of the layers will be quantized to int4 while
+ 20% will be quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size and inference latency. Default value is 0.8.
+ --sym Whether to apply symmetric quantization
+ --group-size GROUP_SIZE
+ The group size to use for int4 quantization. Recommended value is 128 and -1 will results in per-column quantization.
+ --dataset DATASET The dataset used for data-aware compression or quantization with NNCF. You can use the one from the list ['wikitext2','c4','c4-new'] for language models or
+ ['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit'] for diffusion models.
+ --all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an weight compression is applied, they are compressed to INT8.
+ --awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but requires additional time for tuning weights on a calibration dataset. To run AWQ,
+ please also provide a dataset argument. Note: it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
+ --scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and compressed layers. Providing a dataset is required to run scale
+ estimation. Please note, that applying scale estimation takes additional memory and time.
+ --sensitivity-metric SENSITIVITY_METRIC
+ The sensitivity metric for assigning quantization precision to layers. Can be one of the following: ['weight_quantization_error', 'hessian_input_activation',
+ 'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude'].
+ --num-samples NUM_SAMPLES
+ The maximum number of samples to take from the dataset for quantization.
+ --disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. In stateful models all kv-cache
+ inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-stateful option is used, it may result in sub-optimal inference
+ performance. Use it when you intentionally want to use a stateless model, for example, to be compatible with existing OpenVINO native inference code that expects kv-cache inputs
+ and outputs in the model.
+ --disable-convert-tokenizer
+ Do not add converted tokenizer and detokenizer OpenVINO models.
+```
+
+You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
+
+```bash
+optimum-cli export openvino --model gpt2 --weight-format int8 ov_model/
+```
+
+For more information on the quantization parameters checkout the [documentation](inference#weight-only-quantization)
+
+
+<Tip warning={true}>
+
+Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`.
+
+</Tip>
+
+## When loading your model
+
+You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model.
+
+To easily save the resulting model, you can use the `save_pretrained()` method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model.
+
+
+```diff
+- from transformers import AutoModelForCausalLM
++ from optimum.intel import OVModelForCausalLM
+ from transformers import AutoTokenizer
+
+ model_id = "gpt2"
+- model = AutoModelForCausalLM.from_pretrained(model_id)
++ model = OVModelForCausalLM.from_pretrained(model_id, export=True)
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+ save_directory = "ov_model"
+ model.save_pretrained(save_directory)
+ tokenizer.save_pretrained(save_directory)
+```
+
+## After loading your model
+
+```python
+from transformers import AutoModelForCausalLM
+from optimum.exporters.openvino import export_from_model
+
+model = AutoModelForCausalLM.from_pretrained("gpt2")
+export_from_model(model, output="ov_model", task="text-generation-with-past")
+```
+
+Once the model is exported, you can now [load your OpenVINO model](inference) by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.