Skip to content

Releases: huggingface/optimum

v1.5.1: Patch release

24 Nov 14:36
Compare
Choose a tag to compare

Deprecate PyTorch 1.12. for BetterTransformer with better error message (#513)

v1.5.0: BetterTransformer Integration, IOBinding, Optimum Exporters, and Whisper with ONNX Runtime

17 Nov 16:40
Compare
Choose a tag to compare

BetterTransformer

Convert your model into its PyTorch BetterTransformer format using a one liner with the new BetterTransformer integration for faster inference on CPU and GPU!

from optimum.bettertransformer import BetterTransformer

model = BetterTransformer.transform(model)

Check the full list of supported models in the documentaiton, and check out the Google Colab demo.

Contributions

  • BetterTransformer integration (#423)
  • ViT and Wav2Vec2 support (#470)

ONNX Runtime IOBinding support

ORT models (except for ORTModelForCustomTasks) now support IOBinding to avoid data copying overheads between the host and device. Significant inference speedup during the decoding process on GPU.

By default, use_io_binding is set to True when using CUDA. You can turn off the IOBinding in case of any memory issue:

from optimum.onnxruntime import ORTModelForSeq2SeqLM

model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small", use_io_binding=False)

Contributions

  • Add IOBinding support to ONNX Runtime module (#421)

Optimum Exporters

optimum.exporters is a new module that handles the export of PyTorch and TensorFlow models to several backends. Only ONNX is supported for now, and more than 50 architectures can already be exported, among which BERT, GPT-Neo, Bloom, T5, ViT, Whisper, CLIP.

The export can be done via the CLI:

python -m optimum.exporters.onnx --model openai/whisper-tiny.en whisper_onnx/

For more information, check the documentation.

Contributions

  • optimum.exporters creation (#403)
  • Automatic task detection (#445)

Whisper

  • Whisper can be exported to ONNX using optimum.exporters.
  • Whisper can also be exported and ran using optimum.onnxruntime, IO binding is also supported.

Note: For the now the export from optimum.exporters will not be usable by ORTModelForSpeechSeq2Seq. To be able to run inference, export Whisper directly using ORTModelForSpeechSeq2Seq. This will be solved in the next release.

Contributions

  • Whisper support with optimum.onnxruntime and optimum.exporters (#420)

Other contributions

  • ONNX Runtime training now supports ORT 1.13.1 and transformers 4.23.1 (#434)
  • ORTModel can load models from subfolders in a similar fashion as in transformers (#443)
  • ORTOptimizer has been refactored, and a factory class has been added to create common OptimizationConfigs (#457)
  • Fixes and updates in the documentation (#411, #432, #437, #441)
  • Fixes IOBinding (#454, #461)

v1.4.1: Patch release

26 Oct 08:00
Compare
Choose a tag to compare
  • Add inference with ORTModel to ORTTrainer and ORTSeq2SeqTrainer #189
  • Add InferenceSession options and provider to ORTModel #271
  • Add mT5 (#341) and Marian (#393) support to ORTOptimizer
  • Add batchnorm folding torch.fx transformations #348
  • The torch.fx transformations now use the marking methods mark_as_transformed, mark_as_restored, get_transformed_nodes #385
  • Update BaseConfig for transformers 4.22.0 release #386
  • Update ORTTrainer for transformers 4.22.1 release #388
  • Add extra ONNX Runtime quantization options #398
  • Add possibility to pass provider_options to ORTModel #401
  • Add support to pass a specific device for ORTModel, as transformers does for pipelines #427
  • Fixes to support onnxruntime 1.13.1 #430

v1.4.0: ORTQuantizer and ORTOptimizer refactorization

08 Sep 17:56
Compare
Choose a tag to compare

ONNX Runtime

  • Refactorization of ORTQuantizer (#270) and ORTOptimizer (#294)
  • Add ONNX Runtime fused Adam Optimizer (#295)
  • Add ORTModelForCustomTasks allowing ONNX Runtime inference support for custom tasks (#303)
  • Add ORTModelForMultipleChoice allowing ONNX Runtime inference for models with multiple choice classification head (#358)

Torch FX

  • Add FuseBiasInLinear a transformation that fuses the weight and the bias of linear modules (#253)

Improvements and bugfixes

  • Enable the possibility to disregard the precomputed past_key_values during ONNX Runtime inference of Seq2Seq models (#241)
  • Enable node exclusion from quantization for benchmark suite (#284)
  • Enable possibility to use a token authentication when loading a calibration dataset (#289)
  • Fix optimum pipeline when no model is given (#301)

v1.3.0: Torch FX transformations, ORTModelForSeq2SeqLM and ORTModelForImageClassification

12 Jul 12:32
Compare
Choose a tag to compare

Torch FX

The optimum.fx.optimization module (#232) provides a set of torch.fx graph transformations, along with classes and functions to write your own transformations and compose them.

  • The Transformation and ReversibleTransformation represent non-reversible and reversible transformations, and it is possible to write such transformations by inheriting from those classes
  • The compose utility function enables transformation composition
  • Two reversible transformations were added:
    • MergeLinears: merges linear layers that have the same input
    • ChangeTrueDivToMulByInverse: changes a division by a static value to a multiplication of its inverse

ORTModelForSeq2SeqLM

ORTModelForSeq2SeqLM (#199) allows ONNX export and ONNX Runtime inference for Seq2Seq models.

  • When exported, Seq2Seq models are decomposed into three parts : the encoder, the decoder (actually consisting of the decoder with the language modeling head), and the decoder with pre-computed key/values as additional inputs.
  • This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding.

Below is an example that downloads a T5 model from the Hugging Face Hub, exports it through the ONNX format and saves it :

from optimum.onnxruntime import ORTModelForSeq2SeqLM

# Load model from hub and export it through the ONNX format 
model = ORTModelForSeq2SeqLM.from_pretrained("t5-small",  from_transformers=True)

# Save the exported model in the given directory
model.save_pretrained(output_dir)

ORTModelForImageClassification

ORTModelForImageClassification (#226) allows ONNX Runtime inference for models with an image classification head.

Below is an example that downloads a ViT model from the Hugging Face Hub, exports it through the ONNX format and saves it :

from optimum.onnxruntime import ORTModelForImageClassification

# Load model from hub and export it through the ONNX format 
model = ORTModelForImageClassification.from_pretrained("google/vit-base-patch16-224",  from_transformers=True)

# Save the exported model in the given directory
model.save_pretrained(output_dir)

ORTOptimizer

Adds support for converting model weights from fp32 to fp16 by adding a new optimization parameter (fp16) to OptimizationConfig (#273).

Pipelines

Additional pipelines tasks are now supported, here is a list of the supported tasks along with the default model for each:

Below is an example that downloads a T5 small model from the Hub and loads it with transformers pipeline for translation :

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("optimum/t5-small")
model = ORTModelForSeq2SeqLM.from_pretrained("optimum/t5-small")
onnx_translation = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer)

text = "What a beautiful day !"
pred = onnx_translation(text)
# [{'translation_text': "C'est une belle journée !"}]

Breaking change

The ORTModelForXXX execution provider default value is now set to CPUExecutionProvider (#203). Before, if no execution provider was provided, it was set to CUDAExecutionProvider if a gpu was detected, or to CPUExecutionProvider otherwise.

v1.2.3: Patch release

15 Jun 12:36
Compare
Choose a tag to compare
  • Remove intel sub-package, migrating to optimum-intel (#212)
  • Fix the loading and saving of ORTModel optimized and quantized models (#214)

v1.2.2: Patch release

02 Jun 13:27
Compare
Choose a tag to compare
  • Extend QuantizationPreprocessor to dynamic quantization (#196)
  • Introduce unified approach to create transformers vs optimized models benchmark (#194)
  • Bump huggingface_hub version and protobuf fix (#205)

v1.2.1: Patch release

13 May 10:04
Compare
Choose a tag to compare

Add support to Python version 3.7 (#176)

v1.2.0: pipeline and AutoModelForXxx classes to run ONNX Runtime inference

10 May 15:04
Compare
Choose a tag to compare

ORTModel

ORTModelForXXX classes such as ORTModelForSequenceClassification were integrated with the Hugging Face Hub in order to easily export models through the ONNX format, load ONNX models, as well as easily save the resulting model and push it to the 🤗 Hub by using respectively the save_pretrained and push_to_hub methods. An already optimized and / or quantized ONNX model can also be loaded using the ORTModelForXXX classes using the from_pretrained method.

Below is an example that downloads a DistilBERT model from the Hub, exports it through the ONNX format and saves it :

from optimum.onnxruntime import ORTModelForSequenceClassification

# Load model from hub and export it through the ONNX format 
model = ORTModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english", 
    from_transformers=True
)

# Save the exported model
model.save_pretrained("a_local_path_for_convert_onnx_model")

Pipelines

Built-in support for transformers pipelines was added. This allows us to leverage the same API used from Transformers, with the power of accelerated runtimes such as ONNX Runtime.

The currently supported tasks with the default model for each are the following :

  • Text Classification (DistilBERT model fine-tuned on SST-2)
  • Question Answering (DistilBERT model fine-tuned on SQuAD v1.1)
  • Token Classification(BERT large fine-tuned on CoNLL2003)
  • Feature Extraction (DistilBERT)
  • Zero Shot Classification (BART model fine-tuned on MNLI)
  • Text Generation (DistilGPT2)

Below is an example that downloads a RoBERTa model from the Hub, exports it through the ONNX format and loads it with transformers pipeline for question-answering.

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering

# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2",from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

# test the model with using transformers pipeline, with handle_impossible_answer for squad_v2 
optimum_qa = pipeline(task, model=model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = optimum_qa(
  question="What's my name?", context="My name is Philipp and I live in Nuremberg."
)

print(prediction)
# {'score': 0.9041663408279419, 'start': 11, 'end': 18, 'answer': 'Philipp'}

Improvements

  • Add loss when performing the evalutation step using an instance of ORTTrainer, previously not enabled when inference was performed with ONNX Runtime in #152

v1.1.1: Patch release

26 Apr 15:47
Compare
Choose a tag to compare

Habana

ONNX Runtime

  • Add the possibility to specify the execution provider in ORTModel.
  • Add IncludeFullyConnectedNodes class to find the nodes composing the fully connected layers in order to (only) target the latter for quantization to limit the accuracy drop.
  • Update QuantizationPreprocessor so that the intersection of the two sets representing the nodes to quantize and the nodes to exclude from quantization to be an empty set.
  • Rename Seq2SeqORTTrainer to ORTSeq2SeqTrainer for clarity and to keep consistency.
  • Add ORTOptimizer support for ELECTRA models.
  • Fix the loading of pretrained ORTConfig which contains optimization and quantization config.