Skip to content

Releases: huggingface/optimum

v1.8: extended BetterTransformer support, ONNX merged seq2seq models

17 Apr 13:30
Compare
Choose a tag to compare

Extended BetterTransformer support

Various improvements in the PyTorch BetterTransformer integration.

ONNX merged seq2seq models

Instead of using two separate decoder_model.onnx and decoder_with_past_model.onnx models, a single decoder can be used for encoder-decoder models: decoder_model_merged.onnx. This allows to avoid duplicated weights in the two without/with past ONNX models.

By default, if available, the decoder_model_merged.onnx will be used in the ORTModel integration. This can be disabled with the option --no-post-process in the ONNX export CLI, and with use_merged=False in the ORTModel.from_pretrained method.

Example:

optimum-cli export onnx --model t5-small t5_onnx

will give:

└── t5_onnx
    ├── config.json
    ├── decoder_model_merged.onnx
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── encoder_model.onnx
    ├── generation_config.json
    ├── special_tokens_map.json
    ├── spiece.model
    ├── tokenizer_config.json
    └── tokenizer.json

And decoder_model_merged.onnx is enough to be used for inference. We strongly recommend to inspect the subgraphs with netron to understand what are the inputs/outputs, in case the exported model is to be used with an other engine than ONNX Runtime in the Optimum integration.

  • Fix encoder-decoder ONNX merge by @fxmarty in #924
  • Support the merge of decoder without/with past for encoder-decoder models in the ONNX export by @fxmarty in #926
  • Support merged seq2seq models in ORTModel by @fxmarty in #930

New models in the ONNX export

  • Add llama onnx export & onnxruntime support by @nenkoru in #975

Major bugfix

  • Remove constant output in encoder-decoder ONNX models decoder with past by @fxmarty in #920
  • Hash tensor data during deduplication by @VikParuchuri in #932

Potentially breaking changes

The TasksManager replaces legacy tasks names by the canonical ones used on the Hub and in transformers metadata:

  • sequence-classification becomes text-classification,
  • causal-lm becomes text-generation,
  • seq2seq-lm becomes text2text-generation,
  • speech2seq-lm and audio-ctc becomes automatic-speech-recognition,
  • default becomes feature-extraction,
  • masked-lm becomes fill-mask,
  • vision2seq-lm becomes image-to-text

This should not break anything except if you rely on private methods and attributes from TasksManager.

  • Allow to use a custom class in TasksManager & use canonical tasks names by @fxmarty in #967

What's Changed

New Contributors

Full Changelog: v1.7.3...v1.8.2

v1.7.3: Patch release for PyTorch 2.0 and transformers 4.27.0

23 Mar 16:37
Compare
Choose a tag to compare

This patch releases fixes a few bugs with PyTorch 2.0 release, and include a few new features as well.

Breaking change: constant outputs removed from ONNX encoder-decoder models

We removed some constant past key values outputs from encoder-decoder models in the ONNX export. Beware that this could potentially break your existing code, but we recommend to use the new exported models as this removes unnecessary Identity nodes in the models.

  • Remove constant outputs from decoder with past ONNX model for encoder-decoder architectures by @fxmarty in #872

torch.nn.functional.scaled_dot_product_attention support for decoders in BetterTransformer

Pytorch 2.0 introduces in beta torch.nn.functional.scaled_dot_product_attention, a fastpath for attention extending their accelerated transformer features. This is included in optimum.bettertransformer to be used with the following architectures: Bart, Blenderbot, GPT2, GTP-J, M2M100, Marian, Mbart, OPT, Pegasus, T5.

Beware that this is still experimental and speedups have yet to be validated on all architectures.

PyTorch's scaled_dot_product_attention allows to use flash attention and memory efficient attention natively in PyTorch.

Usage is as follow:

from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.bettertransformer import BetterTransformer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

model = BetterTransformer.transform(model)  # modify transformers modeling to use native scaled_dot_product_attention

# do you inference or training here

model = BetterTransformer.reverse(model)  # go back to using canonical transformers modeling
model.save_pretrained("gpt2_model")

Inference benchmark (on fp16):

Model batch size Input sequence length Generated tokens Latency eager (s) Latency BT (s) Speedup Peak memory eager (MB) Peak memory BT (MB) Memory savings
gpt2 1 64 256 1.800 1.607 12.0% 569.90 569.89 0%
gpt2 64 64 256 2.159 1.617 33.5% 2067.45 2093.80 0%
opt-1.3b 1 64 256 3.010 2.667 12.9% 5408.238 5408.238 0%
gpt-neox-20b 1 64 256 10.869 9.937 9.4% 83670.67 83673.53 0%

Training benchmark (on fp16):

Model batch size Sequence length time/epoch (eager, s) time/epoch (BT, s) Speedup Peak memory eager (MB) Peak memory BT (MB) Memory savings
gpt2 8 1024 17.732 14.037 26.3% 13291.16 10191.52 30.4%
gpt2 32 1024 17.336 13.309 30.3% 52834.83 38858.56 36.0%
gpt2 64 1024 OOM 14.067 / OOM 75600.08 /

Benchmarks can be reproduced using the inference script and training script:

python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256
python benchmark_bettertransformer.py --model-name gpt2 --use-half --use-cuda --is_decoder --num-batches 5 --max_token 256 --seqlen-stdev 0

New architectures in the ONNX export

Three additional architectures are supported in the ONNX export: ImageGPT, RegNet, OPT.

(WIP) TFLite export with quantization support

Continued progress in the TFLite export with quantization support. This is work in progress and not documented yet.

Bugfixes and improvements

Read more

v1.7.1: Patch release

03 Mar 13:41
Compare
Choose a tag to compare

Temporarily fix a critical bug in BetterTransformer #849

Full Changelog: v1.7.0...v1.7.1

v1.7.0: ONNX export extension, TFLite export, single-ONNX decoding, ONNX Runtime extension for audio, vision tasks, stable diffusion

02 Mar 12:32
Compare
Choose a tag to compare

New models supported in the ONNX export

Additional architectures are supported in the ONNX export: PoolFormer, Pegasus, Audio Spectrogram Transformer, Hubert, SEW, Speech2Text, UniSpeech, UniSpeech-SAT, Wav2Vec2, Wav2Vec2-Conformer, WavLM, Data2Vec Audio, MPNet, stable diffusion VAE encoder, vision encoder decoder, Nystromformer, Splinter, GPT NeoX.

New models supported in BetterTransformer

A few additional architectures are supported in BetterTransformer: RoCBERT, RoFormer, Marian

Additional tasks supported in the ONNX Runtime integration

With ORTModelForMaskedLM, ORTModelForVision2Seq, ORTModelForAudioClassification, ORTModelForCTC, ORTModelForAudioXVector, ORTModelForAudioFrameClassification, ORTStableDiffusionPipeline.

Reference: https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort and https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models#export-and-inference-of-stable-diffusion-models

Support of the ONNX export from PyTorch on float16

In the ONNX export, it is possible to pass the options --fp16 --device cuda to export using float16 when a GPU is available, directly with the native torch.onnx.export.

Example: optimum-cli export onnx --model gpt2 --fp16 --device cuda gpt2_onnx/

  • Support ONNX export on torch.float16 type by @fxmarty in #749

TFLite export

TFLite export is now supported, with static shapes:

optimum-cli export tflite --help
optimum-cli export tflite --model bert-base-uncased --sequence_length 128 bert_tflite/

ONNX Runtime optimization and quantization directly in the CLI

  • Add optimize and quantize command CLI by @jplu in #700
  • Support ONNX Runtime optimizations in exporters.onnx by @fxmarty in #807

The ONNX export optionally supports the ONNX Runtime optimizations directly in the export, passing the --optimize O1, up to --optimize O4 option:

optimum-cli export onnx --help
optimum-cli export onnx --model t5-small --optimize O3 t5small_onnx/

ONNX Runtime quantization is supported directly in command line, using optimum-cli onnxruntime quantize:

optimum-cli onnxruntime quantize --help
optimum-cli onnxruntime quantize --onnx_model distilbert_onnx --avx512

ONNX Runtime optimization is supported directly in command line, using optimum-cli onnxruntime optimize:

optimum-cli onnxruntime optimize --help
optimum-cli onnxruntime optimize --onnx_model distilbert_onnx -O3

ORTModelForCausalLM supports decoding with a single ONNX

Up no now, for decoders, two ONNX were used:

  • One handling the first forward pass where no past key values have been cached yet - thus not taking them as input.
  • One handling the following forward pass where past key values have been cached, thus taking them as input.

This release introduces the support in the ONNX export and in ORTModelForCausalLM of a single ONNX handling both steps of the decoding. This allows to reduce memory usage, as weights are not duplicated between two separate models during inference.

Using a single ONNX for decoders can be used by passing use_merged=True to ORTModelForCausalLM.from_pretrained, loading directly from a PyTorch model:

from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained("gpt2", export=True, use_merged=True)

Alternatively, using a single ONNX for decoders is the default behavior in the ONNX export, that can later be used for example with ORTModelForCausalLM, the command optimum-cli export onnx --model gpt2 gpt2_onnx/ will produce:

└── gpt2_onnx
    ├── config.json
    ├── decoder_model_merged.onnx
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── merges.txt
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── vocab.json

The decoder_model.onnx and decoder_with_past_model.onnx are kept separate for backward compatibility, but during inference using solely decoder_model_merged.onnx is enough.

  • Enable inference with a merged decoder in ORTModelForCausalLM by @JingyaHuang in #647

Single-file ORTModel accept numpy arrays

ORTModel accept numpy arrays as inputs, in addition to PyTorch tensors. This is only the case for models that use a single ONNX.

  • Accept numpy.ndarray as input and output to ORTModel by @fxmarty in #790

ORTOptimizer support for ORTModelForCausalLM

  • ORTOptimizer support ORTModelForCausalLM by @fxmarty in #794
  • Support IO Binding for merged decoder by @fxmarty in #797

Breaking changes

  • In the ONNX export, exporting models in several ONNX (encoder, decoder) is now the default behavior: #747. The old behavior is still accessible with --monolith.
  • In decoders, reusing past key values is now the default in the ONNX export: #748. The old behavior is still accessible by explicitly passing, for example, --task causal-lm instead of --task causal-lm-with-past.
  • BigBird support in the ONNX export is removed, due to the block_sparse attention type being written in pure numpy in Transformers, and hence not exportable to ONNX: #778
  • The parameter from_transformers of ORTModel.from_pretrained will be deprecated in favor of export.

Bugfixes and improvements

  • Fix disable shape inference for optimization by @regisss in #652
  • Fix uninformative message when passing use_cache=True to ORTModel and no ONNX with cache is available by @fxmarty in #650
  • Fix provider options when several providers are passed by @fxmarty in #653
  • Add TensorRT engine to ONNX Runtime GPU documentation by @fxmarty in #657
  • Improve documentation around ONNX export by @fxmarty in #666
  • minor updates on ONNX config guide by @mszsorondo in #662
  • Fix FlaubertOnnxConfig by @michaelbenayoun in #669
  • Use nvcr.io/nvidia/tensorrt image for GPU tests by @fxmarty in #660
  • Better Transformer doc fix by @HamidShojanazeri in #670
  • Add support for LongT5 optimization using ORT transformer optimizer script by @kunal-vaishnavi in #683
  • Add test for missing execution providers error messages by @fxmarty in #659
  • ONNX transformation to cast int64 constants to int32 when possible by @fxmarty in #655
  • Add missing normalized configs by @fxmarty in #694
  • Remove code duplication in ORTModel's load_model by @fxmarty in #695
  • Test more architectures in ORTModel by @fxmarty in #675
  • Avoid initializing unwanted attributes for ORTModel's having several inference sessions by @fxmarty in #696
  • Fix the ORTQuantizer loading from specific file by @echarlaix in #701
  • Add saving of diffusion model additional components ...
Read more

v1.6.4: Patch release

13 Feb 16:54
Compare
Choose a tag to compare

Bugfix

  • Fix past key/value reuse in decoders following transformers 4.26.0 release and renaming: b9211d6
  • ONNX Runtime 1.14 support: #772

Full Changelog: v1.6.3...v1.6.4

v1.6.3: Patch release

25 Jan 17:28
Compare
Choose a tag to compare

Fixes ORTTrainer for the inference with the ONNX Runtime backend.

v1.6.2: Patch release

25 Jan 11:38
Compare
Choose a tag to compare

Hotfixes

Regressions

The export of speech-to-text architecture as a single ONNX file (that handles both the encoding and decoding) fails do to a regression with the latest transformers version: #721

Full Changelog: v1.6.1...v1.6.2

v1.6.1: Patch release

23 Dec 20:32
Compare
Choose a tag to compare

Hotfixes

  • Revert breaking removal of EncoderOnnxConfig, DecoderOnnxConfig, _DecoderWithLMhead by @fxmarty in #643
  • Fix item access of some _TASKS_TO_AUTOMODELS by @fxmarty in #642

Full Changelog: v1.6.0...v1.6.1

v1.6.0: Optimum CLI, Stable Diffusion ONNX export, BetterTransformer & ONNX support for more architectures

23 Dec 15:30
Compare
Choose a tag to compare

Optimum CLI

The Optimum command line interface is introduced, and is now the official entrypoint for the ONNX export. Example commands:

optimum-cli --help
optimum-cli export onnx --help
optimum-cli export onnx --model bert-base-uncased --task sequence-classification bert_onnx/

Stable Diffusion ONNX export

Optimum now supports the ONNX export of stable diffusion models from the diffusers library:

optimum-cli export onnx --model runwayml/stable-diffusion-v1-5 sd_v15_onnx/

BetterTransformer support for more architectures

BetterTransformer integration includes new models in this release: CLIP, RemBERT, mBART, ViLT, FSMT

The complete list of supported models is available in the documentation.

ONNX export for more architectures

The ONNX export now supports Swin, MobileNet-v1, MobileNet-v2.

Extended ONNX export for encoder-decoder and decoder models

Encoder-decoder or decoder-only models normally making use of the generate() method in transformers can now be exported in several files using the --for-ort argument:

optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_small_onnx

yielding:

.
└── t5_small_onnx
    ├── config.json
    ├── decoder_model.onnx
    ├── decoder_with_past_model.onnx
    ├── encoder_model.onnx
    ├── special_tokens_map.json
    ├── spiece.model
    ├── tokenizer_config.json
    └── tokenizer.json

Passing --for-ort, exported models are expected to be loadable directly into ORTModel.

  • Add ort export in exporters for encoder-decoder models by @mht-sharma in #497
  • Support decoder generated with --for-ort from optimum.exporters.onnx in ORTDecoder by @fxmarty in #554

Support for ONNX models with external data at export, optimization, quantization

The ONNX export from PyTorch normally creates external data in case the exported model is larger than 2 GB. This release introduces a better support for the export and use of large models, writting all external data into a .onnx_data file if necessary.

  • Handling ONNX models with external data by @NouamaneTazi in #586
  • Improve the compatibility dealing with large ONNX proto in ORTOptimizer and ORTQuantizer by @JingyaHuang in #332

ONNX Runtime API improvement

Various improvements to allow for a better user experience in the ONNX Runtime integration:

  • ORTModel, ORTModelDecoder and ORTModelForConditionalGeneration can now load any ONNX model files regardless of their names, allowing to load optimized and quantized models without having to specify a file name argument.

  • ORTModel.from_pretrained() with from_transformers=True now downloads and loads the model in a temporary directory instead of the cache, which was not a right place to store it.

  • ORTQuantizer.save_pretrained() now saves the model configuration and the preprocessor, making the exported directory usable end-to-end.

  • ORTOptimizer.save_pretrained() now saves the preprocessor, making the exported directory usable end-to-end.

  • ONNX Runtime integration API improvement by @michaelbenayoun in #515

Custom shapes support at ONNX export

The shape of the example input to provide for the export to ONNX can be overridden in case the validity of the ONNX model is sensitive to the shape used during the export.

Read more: optimum-cli export onnx --help

  • Support custom shapes for dummy inputs by @fxmarty in #522
  • Support for custom input shapes in exporters onnx by @fxmarty in #575

Enable use_cache=True for ORTModelForCausalLM

Reusing past key values for models using ORTModelForCausalLM (e.g. gpt2) is now possible using use_cache=True, avoiding to recompute them at each iteration of the decoding:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = ORTModelForCausalLM.from_pretrained("gpt2", from_transformers=True, use_cache=True)

inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")

gen_tokens = model.generate(**inputs)
tokenizer.batch_decode(gen_tokens)
  • Enable past_key_values for ORTModelForCausalLM by @echarlaix in #326

IO binding support for ORTModelForCustomTasks

ORTModelForCustomTasks now supports IO Binding when using CUDAExecutionProvider.

Experimental support to merge ONNX decoder with/without past key values

Along with --for-ort, when passing --task causal-lm-with-past , --task seq2seq-with-past or --task speech2seq-lm-with-past during the ONNX export exports two models: one not using the previously computed keys/values, and one using them.

An experimental support is introduced to merge the two models in one. Example:

optimum-cli export onnx --model t5-small --task seq2seq-lm-with-past --for-ort t5_onnx/
import onnx
from optimum.onnx import merge_decoders

decoder = onnx.load("t5_onnx/decoder_model.onnx")
decoder_with_past = onnx.load("t5_onnx/decoder_with_past_model.onnx")

merged_model = merge_decoders(decoder, decoder_with_past)
onnx.save(merged_model, "t5_onnx/decoder_merged_model.onnx")

Major bugs fixed

Other changes, bugfixes and improvements

Read more

v1.5.2: Patch release

19 Dec 16:26
Compare
Choose a tag to compare

Constraint temporarily numpy<1.24.0 (#614)