Microsoft Phi-3 has been optimized for ONNX Runtime and supports Windows DirectML. It works well across various hardware types, including GPUs, CPUs, and even mobile devices.
Specifically, the supported hardware includes:
- GPU SKU: RTX 4090 (DirectML)
- GPU SKU: 1 A100 80GB (CUDA)
- CPU SKU: Standard F64s v2 (64 vCPUs, 128 GiB memory)
- Android - Samsung Galaxy S21
- Apple iPhone 14 or higher A16/A17 Processor
- Minimum Configuration Required.
- Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
CUDA: NVIDIA GPU with Compute Capability >= 7.02
Currently available Phi-3 ONNX models are only for 1 GPU. It's possible to support multi-gpu for Phi-3 model, but ORT with 2 gpu doesn't guarantee that it will give more throughput compared to 2 instance of ort.
At Build 2024 the GenAI ONNX Team announced that they had enabled multi-instance instead of multi-gpu for Phi models.
At present this allows you to run one onnnxruntime or onnxruntime-genai instance with CUDA_VISIBLE_DEVICES environment variable like this.
CUDA_VISIBLE_DEVICES=0 python infer.py
CUDA_VISIBLE_DEVICES=1 python infer.py
Feel free to explore Phi-3 further in Azure AI Studio