- furiosa-libcompiler >= 0.9.0(See for detailed instructions, https://www.notion.so/furiosa/K8s-Pod-SDK-27680e93c9e9484e9b6f49ad11989c82?pvs=4)
$ python3 -m venv env
$ . env/bin/activate
$ pip3 install --upgrade pip setuptools wheel
$ pip3 install -e .
https://huggingface.co/docs/transformers/model_doc/gpt_neo
$ python3 -m optimum.litmus.nlp.gpt-neo --help
usage: FuriosaAI litmus GPT Neo using HF Optimum API. [-h] [--model-size {125m,1.3b,2.7b}] [--batch-size BATCH_SIZE] [--input-len INPUT_LEN] [--gen-step GEN_STEP]
[--task {text-generation-with-past}]
output_dir
positional arguments:
output_dir path to directory to save outputs
optional arguments:
-h, --help show this help message and exit
--model-size {125m,1.3b,2.7b}, -s {125m,1.3b,2.7b}
available model sizes
--batch-size BATCH_SIZE, -b BATCH_SIZE
Batch size for model inputs
--input-len INPUT_LEN
Length of input prommpt
--gen-step GEN_STEP Generation step to simplify onnx graph
--task {text-generation-with-past}
Task to export model for
https://huggingface.co/docs/transformers/model_doc/gpt2
$ python3 -m optimum.litmus.nlp.gpt2 --help
usage: FuriosaAI litmus GPT2 using HF Optimum API. [-h] [--model-size {s,m,l,xl}] [--batch-size BATCH_SIZE] [--input-len INPUT_LEN] [--gen-step GEN_STEP] [--task {text-generation-with-past}]
output_dir
positional arguments:
output_dir path to directory to save outputs
optional arguments:
-h, --help show this help message and exit
--model-size {s,m,l,xl}, -s {s,m,l,xl}
available model sizes
--batch-size BATCH_SIZE, -b BATCH_SIZE
Batch size for model inputs
--input-len INPUT_LEN
Length of input prommpt
--gen-step GEN_STEP Generation step to simplify onnx graph
--task {text-generation-with-past}
Task to export model for
https://huggingface.co/docs/transformers/model_doc/opt
usage: FuriosaAI litmus OPT using HF Optimum API. [-h] [--model-size {125m,350m,1.3b,2.7b,6.7b,30b,66b}] [--batch-size BATCH_SIZE] [--input-len INPUT_LEN] [--gen-step GEN_STEP]
[--task {text-generation-with-past}]
output_dir
positional arguments:
output_dir path to directory to save outputs
options:
-h, --help show this help message and exit
--model-size {125m,350m,1.3b,2.7b,6.7b,30b,66b}, -s {125m,350m,1.3b,2.7b,6.7b,30b,66b}
available model sizes
--batch-size BATCH_SIZE, -b BATCH_SIZE
Batch size for model inputs
--input-len INPUT_LEN
Length of input prommpt
--gen-step GEN_STEP Generation step to simplify onnx graph
--task {text-generation-with-past}
Task to export model for
https://huggingface.co/docs/transformers/model_doc/llama
$ python3 -m optimum.litmus.nlp.llama --help
usage: FuriosaAI litmus LLaMA using HF Optimum API. [-h] [--model-size {7b,13b,30b,65b}] [--batch-size BATCH_SIZE] [--input-len INPUT_LEN] [--gen-step GEN_STEP]
[--task {text-generation-with-past}]
output_dir
positional arguments:
output_dir path to directory to save outputs
options:
-h, --help show this help message and exit
--model-size {7b,13b,30b,65b}, -s {7b,13b,30b,65b}
available model sizes
--batch-size BATCH_SIZE, -b BATCH_SIZE
Batch size for model inputs
--input-len INPUT_LEN
Length of input prommpt
--gen-step GEN_STEP Generation step to simplify onnx graph
--task {text-generation-with-past}
Task to export model for
(optimum) root@linux-warboy-jasonzcnl2:~/workspace/optimum# python3 -m optimum.litmus.nlp.toy_model --help
usage: FuriosaAI litmus exporting toy model(w/o pretrained weights) using HF Optimum API. [-h] [--config-path CONFIG_PATH] [--batch-size BATCH_SIZE]
[--input-len INPUT_LEN] [--gen-step GEN_STEP]
[--task {text-generation-with-past}]
output_dir
positional arguments:
output_dir path to directory to save outputs
options:
-h, --help show this help message and exit
--config-path CONFIG_PATH, -c CONFIG_PATH
path to model config saved in json format
--batch-size BATCH_SIZE, -b BATCH_SIZE
Batch size for model inputs
--input-len INPUT_LEN
Length of input prommpt
--gen-step GEN_STEP Generation step to simplify onnx graph
--task {text-generation-with-past}
Task to export model for
example
$ python3 -m optimum.litmus.nlp.toy_model toy/gpt2 -c configs/gpt2-toy.json -b 1 --input-len 128 --gen-step 0
Proceeding model exporting and optimization based given model config:
{
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 1023,
"embd_pdrop": 0.1,
"eos_token_id": 1023,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 128,
"n_head": 4,
"n_layer": 3,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"vocab_size": 1024,
"_reference": "https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Config"
}
Exporting ONNX Model...
use_past = False is different than use_present_in_outputs = True, the value of use_present_in_outputs value will be used for the outputs.
Using framework PyTorch: 2.0.1+cu117
Overriding 2 configuration item(s)
- use_cache -> True
- pad_token_id -> 0
/root/miniconda3/envs/optimum/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py:810: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if batch_size <= 0:
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
Using framework PyTorch: 2.0.1+cu117
Overriding 2 configuration item(s)
- use_cache -> True
- pad_token_id -> 0
Asked a sequence length of 16, but a sequence length of 1 will be used with use_past == True for `input_ids`.
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
Asked a sequence length of 16, but a sequence length of 1 will be used with use_past == True for `input_ids`.
Post-processing the exported models...
Validating ONNX model toy/gpt2/decoder_model_merged.onnx...
-[β] ONNX model output names match reference model (present.0.key, present.0.value, present.2.value, present.1.key, present.1.value, present.2.key, logits)
- Validating ONNX Model output "logits":
-[β] (2, 16, 1024) matches (2, 16, 1024)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.0.key":
-[β] (2, 4, 16, 32) matches (2, 4, 16, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.0.value":
-[β] (2, 4, 16, 32) matches (2, 4, 16, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.1.key":
-[β] (2, 4, 16, 32) matches (2, 4, 16, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.1.value":
-[β] (2, 4, 16, 32) matches (2, 4, 16, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.2.key":
-[β] (2, 4, 16, 32) matches (2, 4, 16, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.2.value":
-[β] (2, 4, 16, 32) matches (2, 4, 16, 32)
-[β] all values close (atol: 1e-05)
Validating ONNX model toy/gpt2/decoder_model_merged.onnx...
Asked a sequence length of 16, but a sequence length of 1 will be used with use_past == True for `input_ids`.
-[β] ONNX model output names match reference model (present.0.key, present.0.value, present.2.value, present.1.key, present.1.value, present.2.key, logits)
- Validating ONNX Model output "logits":
-[β] (2, 1, 1024) matches (2, 1, 1024)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.0.key":
-[β] (2, 4, 17, 32) matches (2, 4, 17, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.0.value":
-[β] (2, 4, 17, 32) matches (2, 4, 17, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.1.key":
-[β] (2, 4, 17, 32) matches (2, 4, 17, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.1.value":
-[β] (2, 4, 17, 32) matches (2, 4, 17, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.2.key":
-[β] (2, 4, 17, 32) matches (2, 4, 17, 32)
-[β] all values close (atol: 1e-05)
- Validating ONNX Model output "present.2.value":
-[β] (2, 4, 17, 32) matches (2, 4, 17, 32)
-[β] all values close (atol: 1e-05)
The ONNX export succeeded and the exported model was saved at: toy/gpt2
Simplifying ONNX Model...
Checking 1/5...
Checking 2/5...
Checking 3/5...
Checking 4/5...
Checking 5/5...
βββββββββββββββββββ³βββββββββββββββββ³βββββββββββββββββββ
β β Original Model β Simplified Model β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β Add β 33 β 30 β
β Cast β 11 β 1 β
β Concat β 40 β 0 β
β Constant β 343 β 42 β
β ConstantOfShape β 3 β 0 β
β Div β 10 β 10 β
β Gather β 53 β 1 β
β Gemm β 12 β 12 β
β Identity β 22 β 0 β
β MatMul β 7 β 7 β
β Mul β 20 β 20 β
β Pow β 13 β 10 β
β Range β 1 β 0 β
β ReduceMean β 14 β 14 β
β Reshape β 40 β 39 β
β Shape β 73 β 0 β
β Slice β 28 β 0 β
β Softmax β 3 β 3 β
β Split β 3 β 3 β
β Sqrt β 7 β 7 β
β Squeeze β 22 β 0 β
β Sub β 11 β 8 β
β Tanh β 3 β 3 β
β Transpose β 15 β 15 β
β Unsqueeze β 78 β 2 β
β Where β 3 β 3 β
β Model Size β 4.9MiB β 3.4MiB β
βββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββββ
[1/1] π Compiling from onnx to dfg
Done in 0.01256042s
β¨ Finished in 0.01283372s
https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/text2img https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_2
$ python3 -m optimum.litmus.multimodal.stable-diffusion -h
usage: FuriosaAI litmus Stable Diffusion using HF Optimum API. [-h] --version {1.5,2.1} [--batch-size BATCH_SIZE] [--latent_shape latent_height latent_width] [--input-len INPUT_LEN] output_dir
positional arguments:
output_dir path to directory to save outputs
options:
-h, --help show this help message and exit
--version {1.5,2.1}, -v {1.5,2.1}
Available model versions
--batch-size BATCH_SIZE, -b BATCH_SIZE
Batch size for latent and prompt inputs
--latent_shape latent_height latent_width
Shape of latent input. Note it is 1/8 of output image sizes
--input-len INPUT_LEN
Length of input prompt
π€ Optimum is an extension of π€ Transformers and Diffusers, providing a set of optimization tools enabling maximum efficiency to train and run models on targeted hardware, while keeping things easy to use.
π€ Optimum can be installed using pip
as follows:
python -m pip install optimum
If you'd like to use the accelerator-specific features of π€ Optimum, you can install the required dependencies according to the table below:
Accelerator | Installation |
---|---|
ONNX Runtime | python -m pip install optimum[onnxruntime] |
Intel Neural Compressor | python -m pip install optimum[neural-compressor] |
OpenVINO | python -m pip install optimum[openvino,nncf] |
Habana Gaudi Processor (HPU) | python -m pip install optimum[habana] |
To install from source:
python -m pip install git+https://github.com/huggingface/optimum.git
For the accelerator-specific features, append #egg=optimum[accelerator_type]
to the above command:
python -m pip install git+https://github.com/huggingface/optimum.git#egg=optimum[onnxruntime]
π€ Optimum provides multiple tools to export and run optimized models on various ecosystems:
- ONNX / ONNX Runtime
- TensorFlow Lite
- OpenVINO
- Habana first-gen Gaudi / Gaudi2, more details here
The export and optimizations can be done both programmatically and with a command line.
Features | ONNX Runtime | Neural Compressor | OpenVINO | TensorFlow Lite |
---|---|---|---|---|
Graph optimization | βοΈ | N/A | βοΈ | N/A |
Post-training dynamic quantization | βοΈ | βοΈ | N/A | βοΈ |
Post-training static quantization | βοΈ | βοΈ | βοΈ | βοΈ |
Quantization Aware Training (QAT) | N/A | βοΈ | βοΈ | N/A |
FP16 (half precision) | βοΈ | N/A | βοΈ | βοΈ |
Pruning | N/A | βοΈ | βοΈ | N/A |
Knowledge Distillation | N/A | βοΈ | βοΈ | N/A |
This requires to install the OpenVINO extra by doing pip install optimum[openvino,nncf]
To load a model and run inference with OpenVINO Runtime, you can just replace your AutoModelForXxx
class with the corresponding OVModelForXxx
class. To load a PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, you can set export=True
when loading your model.
- from transformers import AutoModelForSequenceClassification
+ from optimum.intel import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
model.save_pretrained("./distilbert")
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
results = classifier("He's a dreadful magician.")
You can find more examples in the documentation and in the examples.
This requires to install the Neural Compressor extra by doing pip install optimum[neural-compressor]
Dynamic quantization can be applied on your model:
optimum-cli inc quantize --model distilbert-base-cased-distilled-squad --output ./quantized_distilbert
To load a model quantized with Intel Neural Compressor, hosted locally or on the π€ hub, you can do as follows :
from optimum.intel import INCModelForSequenceClassification
model_id = "Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-dynamic"
model = INCModelForSequenceClassification.from_pretrained(model_id)
You can find more examples in the documentation and in the examples.
This requires to install the ONNX Runtime extra by doing pip install optimum[exporters,onnxruntime]
It is possible to export π€ Transformers models to the ONNX format and perform graph optimization as well as quantization easily:
optimum-cli export onnx -m deepset/roberta-base-squad2 --optimize O2 roberta_base_qa_onnx
The model can then be quantized using onnxruntime
:
optimum-cli onnxruntime quantize \
--avx512 \
--onnx_model roberta_base_qa_onnx \
-o quantized_roberta_base_qa_onnx
These commands will export deepset/roberta-base-squad2
and perform O2 graph optimization on the exported model, and finally quantize it with the avx512 configuration.
For more information on the ONNX export, please check the documentation.
Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seemless manner using ONNX Runtime in the backend:
- from transformers import AutoModelForQuestionAnswering
+ from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
model_id = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
- model = AutoModelForQuestionAnswering.from_pretrained(model_id)
+ model = ORTModelForQuestionAnswering.from_pretrained("roberta_base_qa_onnx")
qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
question = "What's Optimum?"
context = "Optimum is an awesome library everyone should use!"
results = qa_pipe(question=question, context=context)
More details on how to run ONNX models with ORTModelForXXX
classes here.
This requires to install the Exporters extra by doing pip install optimum[exporters-tf]
Just as for ONNX, it is possible to export models to TensorFlow Lite and quantize them:
optimum-cli export tflite \
-m deepset/roberta-base-squad2 \
--sequence_length 384 \
--quantize int8-dynamic roberta_tflite_model
π€ Optimum provides wrappers around the original π€ Transformers Trainer to enable training on powerful hardware easily. We support many providers:
- Habana's Gaudi processors
- ONNX Runtime (optimized for GPUs)
This requires to install the Habana extra by doing pip install optimum[habana]
- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiTrainer, GaudiTrainingArguments
# Download a pretrained model from the Hub
model = AutoModelForXxx.from_pretrained("bert-base-uncased")
# Define the training arguments
- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
output_dir="path/to/save/folder/",
+ use_habana=True,
+ use_lazy_mode=True,
+ gaudi_config_name="Habana/bert-base-uncased",
...
)
# Initialize the trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
...
)
# Use Habana Gaudi processor for training!
trainer.train()
You can find more examples in the documentation and in the examples.
- from transformers import Trainer, TrainingArguments
+ from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments
# Download a pretrained model from the Hub
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# Define the training arguments
- training_args = TrainingArguments(
+ training_args = ORTTrainingArguments(
output_dir="path/to/save/folder/",
optim="adamw_ort_fused",
...
)
# Create a ONNX Runtime Trainer
- trainer = Trainer(
+ trainer = ORTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
+ feature="sequence-classification", # The model type to export to ONNX
...
)
# Use ONNX Runtime for training!
trainer.train()
You can find more examples in the documentation and in the examples.