Skip to content

Latest commit

 

History

History
22 lines (21 loc) · 5.13 KB

examples.md

File metadata and controls

22 lines (21 loc) · 5.13 KB

Examples

Scenario Model Examples Hardware Targeted Optimization
NLP llama2 Link CPU: with ONNX Runtime optimizations for optimized FP32 ONNX model
CPU: with ONNX Runtime optimizations for optimized INT8 ONNX model
CPU: with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model
GPU: with ONNX Runtime optimizations for optimized INT4 ONNX model
GPU: with QLoRA for model fine tune and ONNX Runtime optimizations for optimized ONNX model
AzureML compute: with AzureML compute to fine tune and optimize for your local GPUs
mistral Link CPU: with Optimum conversion and ONNX Runtime optimizations and Intel® Neural Compressor static quantization for optimized INT8 ONNX model
GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model
open llama Link GPU: with Optimum conversion and merging and ONNX Runtime optimizations for optimized ONNX model
GPU: with SparseGPT and TorchTRT conversion for an optimized PyTorch model with sparsity
AzureML compute: with Optimum conversion and merging and ONNX Runtime optimizations in AzureML
CPU: with Optimum conversion and merging and ONNX Runtime optimizations and Intel® Neural Compressor 4-bits weight-only quantization for optimized INT4 ONNX model
phi2 Link CPU: with ONNX Runtime optimizations fp32/int4
GPU with ONNX Runtime optimizations fp16/int4, with PyTorch QLoRA for model fine tune
GPU with SliceGPT for an optimized PyTorch model with sparsity
falcon Link GPU: with ONNX Runtime optimizations for optimized FP16 ONNX model
red pajama Link CPU: with Optimum conversion and merging and ONNX Runtime optimizations for a single optimized ONNX model
bert Link CPU: with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model
CPU: with ONNX Runtime optimizations and Intel® Neural Compressor quantization for optimized INT8 ONNX model
CPU: with PyTorch QAT Customized Training Loop and ONNX Runtime optimizations for optimized ONNX INT8 model
GPU: with ONNX Runtime optimizations for CUDA EP
GPU: with ONNX Runtime optimizations for TRT EP
deberta Link GPU: Optimize Azureml Registry Model with ONNX Runtime optimizations and quantization
gptj Link CPU: with Intel® Neural Compressor static/dynamic quantization for INT8 ONNX model
Audio whisper Link CPU: with ONNX Runtime optimizations for all-in-one ONNX model in FP32
CPU: with ONNX Runtime optimizations for all-in-one ONNX model in INT8
CPU: with ONNX Runtime optimizations and Intel® Neural Compressor Dynamic Quantization for all-in-one ONNX model in INT8
GPU: with ONNX Runtime optimizations for all-in-one ONNX model in FP32
GPU: with ONNX Runtime optimizations for all-in-one ONNX model in FP16
GPU: with ONNX Runtime optimizations for all-in-one ONNX model in INT8
audio spectrogram
transformer
Link CPU: with ONNX Runtime optimizations and quantization for optimized INT8 ONNX model
Vision stable diffusion
stable diffusion XL
Link GPU: with ONNX Runtime optimization for DirectML EP
GPU: with ONNX Runtime optimization for CUDA EP
Intel CPU: with OpenVINO toolkit
squeezenet Link GPU: with ONNX Runtime optimizations with DirectML EP
mobilenet Link Qualcomm NPU: with ONNX Runtime static QDQ quantization for ONNX Runtime QNN EP
resnet Link CPU: with ONNX Runtime static/dynamic Quantization for ONNX INT8 model
CPU: with PyTorch QAT Default Training Loop and ONNX Runtime optimizations for ONNX INT8 model
CPU: with PyTorch QAT Lightning Module and ONNX Runtime optimizations for ONNX INT8 model
AMD DPU: with AMD Vitis-AI Quantization
Intel GPU: with ONNX Runtime optimizations with multiple EPs
VGG Link Qualcomm NPU: with SNPE toolkit
inception Link Qualcomm NPU: with SNPE toolkit
super resolution Link CPU: with ONNX Runtime pre/post processing integration for a single ONNX model