- Overview
- Models
- Training Accuracy Results
- Training Performance Results
- Inference Performance Results
- Configuration
- Additional Performance Data for Intel AI Data Center Products
This document demonstrates the training and inference performance as well as accuracy results on several popular AI workloads with Intel® Extension for TensorFlow* benchmarked on Intel GPUs. You can easily reproduce these results following the guidlines in examples.
The following tables provide the links where you can get the original code repository and step-by-step guide running on Intel GPUs for each model.
Model | Original Model Repo | ITEX Step-by-Step Guide |
---|---|---|
ResNet50v1.5 | TensorFlow-Models/ResNet50v1.5 | Resnet50 train on Intel GPU |
BERT-Large | DeepLearningExamples/BERT | Accelerate BERT-Large Pretraining on Intel GPU |
Mask-RCNN | DeepLearningExamples/Mask-RCNN | Accelerate Mask R-CNN Training on Intel GPU |
3D-UNet | DeepLearningExamples/3D-UNet | Accelerate 3D-UNet Training for medical image segmentation on Intel GPU |
Model | Original Model Repo | ITEX Step-by-Step Guide |
---|---|---|
ResNet50v1.5 | Intel-Reference-Models/ResNet50v1.5 | ResNet50v1.5 Model Inference with Intel® Extention for TensorFlow* |
EfficientNet-B0 | Keras-Applications/EfficientNet | Use the exact same codes and instructions as in the orignal model repo |
EfficientNet-B3 | Keras-Applications/EfficientNet | Use the exact same codes and instructions as in the orignal model repo |
Mask-RCNN | DeepLearningExamples/Mask-RCNN | Use the exact same codes and instructions as in the orignal model repo |
Stable Diffusion v1-4 | KerasCV/Stable-Diffusion | Stable Diffusion Inference for Text2Image on Intel GPU |
The following table shows the BERT-Large performance, training loss and time-to-train (TTT) results for both the pre-training and fine-tuning phases on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU).
Pre-training Phase1 | Pre-training Phase2 | Fine-Tuning | |
---|---|---|---|
Dataset | Wikipedia and BookCorpus | Wikipedia and BookCorpus | SQuAD 1.1 |
Maximum Sequence Length | 128 | 512 | 384 |
Data Type | BF16 | BF16 | BF16 |
Throughput (sequences/sec) | 3265.35 | 699.25 | 523.55 |
Time to Train (hours) | 39.32 | 20.40 | 0.67 |
Loss | 1.6047 | 1.3870 | 0.6867 |
The following tables show the performance numbers for several popular training workloads on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). For these workloads, we enable and benchmark both FP32 training and BF16 automatic mixed precision (AMP) training with 1-Stack of 1x Max 1550, 2-Stack of 1x Max 1550 as well as 4x Max 1550 (with 8 Stacks in total), to showcase the performance boost and scalability with Intel® Extension for TensorFlow* and Intel® Optimization for Horovod*.
Note: The training performance result on each workload below for
1x Max 1550 w/ 1-Stack
represents the minimum value of the performance results on 2 stacks of single GPU, with 2 instances initiated simultaneously, while each stack of the GPU executing the workload separately, without distributed training.
GPUs | Ranks | Local Batch Size: FP32, BF16 |
Training Steps |
Throughput w/ TF32 (images/sec) |
Throughput w/ BF16 (images/sec) |
Throughput Speedup w/ AMP |
Weak Scaling w/ TF32 |
Weak Scaling w/ BF16 |
---|---|---|---|---|---|---|---|---|
1x Max 1550 w/ 1-Stack | 1 | 256, 512 | 5000 | 918.96 | 1766.53 | 1.92x | 1.00 | 1.00 |
1x Max 1550 w/ 2-Stack | 2 | 256, 512 | 5000 | 1762.76 | 3461.86 | 1.96x | 1.92 | 1.96 |
4x Max 1550 | 8 | 256, 256 | 5000 | NA | 12278.32 | NA | NA | 6.95 |
GPUs | Ranks | Local Batch Size x Accumulation Steps |
Training Steps |
Throughput w/ TF32 (sequences/sec) |
Throughput w/ BF16 (sequences/sec) |
Throughput Speedup w/ AMP |
Weak Scaling w/ TF32 |
Weak Scaling w/ BF16 |
---|---|---|---|---|---|---|---|---|
1x Max 1550 w/ 1-Stack | 1 | 32 x 30 | 20 | 36.22 | 93.22 | 2.57x | 1.00 | 1.00 |
1x Max 1550 w/ 2-Stack | 2 | 32 x 30 | 20 | 74.40 | 182.57 | 2.45x | 2.05 | 1.96 |
4x Max 1550 | 8 | 32 x 30 | 20 | NA | 692.11 | NA | NA | 7.42 |
GPUs | Ranks | Local Batch Size | Training Steps | Throughput w/ BF16 (images/sec) | Weak Scaling w/ BF16 |
---|---|---|---|---|---|
1x Max 1550 w/ 1-Stack | 1 | 4 | 20 | 29.03 | 1.00 |
1x Max 1550 w/ 2-Stack | 2 | 4 | 20 | 55.51 | 1.91 |
GPUs | Ranks | Local Batch Size | Training Steps | Throughput w/ BF16 (samples/sec) | Weak Scaling w/ BF16 |
---|---|---|---|---|---|
1x Max 1550 w/ 1-Stack | 1 | 1 | 1000 | 12.81 | 1.00 |
1x Max 1550 w/ 2-Stack | 2 | 1 | 1000 | 23.56 | 1.84 |
4x Max 1550 | 8 | 1 | 1000 | 87.07 | 6.80 |
The following tables show the performance numbers for several popular inference workloads on 1x Intel® Data Center GPU Flex 170 (150W PCIe, 1-stack for each GPU).
Note: Inference with online mode refers to running the workloads using 1 as the batch size, while inference with batch mode utilizes larger batch size.
GPUs | Dataset | Image Size | Mode | Batch Size | Data Type | Inference Steps | Throughput (images/sec) |
---|---|---|---|---|---|---|---|
1x Flex 170 | Dummy | 224x224 | Online | 1 | INT8 | 5000 | 435.01 |
1x Flex 170 | Dummy | 224x224 | Batch | 1024 | INT8 | 5000 | 9842.75 |
GPUs | Dataset | Image Size | Mode | Batch Size | Data Type | Inference Steps | Throughput (images/sec) |
---|---|---|---|---|---|---|---|
1x Flex 170 | Dummy | 224x224 | Batch | 64 | FP16 (AMP) | 50 | 3007.60 |
1x Flex 170 | Dummy | 224x224 | Batch | 128 | FP16 (AMP) | 50 | 3587.29 |
GPUs | Dataset | Image Size | Mode | Batch Size | Data Type | Inference Steps | Throughput (images/sec) |
---|---|---|---|---|---|---|---|
1x Flex 170 | Dummy | 300x300 | Batch | 64 | FP16 (AMP) | 50 | 928.56 |
1x Flex 170 | Dummy | 300x300 | Batch | 128 | FP16 (AMP) | 50 | 968.83 |
GPUs | Dataset | Mode | Batch Size | Data Type | Inference Steps | Throughput (images/sec) |
---|---|---|---|---|---|---|
1x Flex 170 | COCO 2017 | Online | 1 | FP16 (AMP) | 5000 | 19.38 |
1x Flex 170 | COCO 2017 | Batch | 16 | FP16 (AMP) | 312 | 43.02 |
GPUs | Dataset | Output Image Size |
Mode | Batch Size | Data Type | Diffusion Steps | Throughput (iterations/sec) |
Throughput Speedup w/ FP16 |
---|---|---|---|---|---|---|---|---|
1x Flex 170 | Text Prompt | 512x512 | Online | 1 | FP32 | 50 | 2.91 | 1.00x |
1x Flex 170 | Text Prompt | 512x512 | Online | 1 | FP16 (pure) | 50 | 6.53 | 2.24x |
Software Component | Version |
---|---|
GPU Driver | 736.25 |
Intel® oneAPI Base Toolkit | 2024.0 |
TensorFlow | v2.14.0 |
Intel® Extension for TensorFlow* | v2.14.0.1 |
Intel® Optimization for Horovod* | v0.28.1.2 |
Software Component | Version |
---|---|
GPU Driver | 736.25 |
Intel® oneAPI Base Toolkit | 2024.0 |
TensorFlow | v2.14.0 |
Intel® Extension for TensorFlow* | v2.14.0.1 |
GPU System | 4x Intel® Data Center GPU Max 1550 |
---|---|
Number of Nodes | 1 |
Xe®-Cores per GPU | 128 in total 2-Stack |
Memory Size per GPU | 128 GB HBM2e in total 2-Stack |
TDP per GPU | 600W |
GPU ECC Setting | OFF |
Server Board | Intel® Denali Pass D50DNP1SBB |
OS | SUSE Linux Enterprise Server 15 SP4 |
Kernel | 5.14.21-150400.24.69-default |
CPU Model | Intel® Xeon® Platinum 8480+ @ 2.00 GHz |
Number of Sockets | 2 |
CPU Cores per Socket | 56 |
Hyper Threading | ON |
Turbo Boost | ON |
Automatic NUMA Balancing | Enabled |
CPU Frequency Governor | Performance |
TDP per CPU | 350W |
Installed Memory | 1024GB (16x64GB 4800 MT/s DDR5) |
NIC | 1x Intel® Ethernet Controller X710 for 10GBASE-T |
Storage | 1x WD® WD_BLACK SN850X 2TB NVMe SSD |
GPU System | 1x Intel® Data Center GPU Flex 170 |
---|---|
Number of Nodes | 1 |
Xe®-Cores per GPU | 32 |
Memory Size per GPU | 16 GB GDDR6 |
TDP per GPU | 150W |
GPU ECC Setting | ON |
Server Board | Intel® Whitley |
OS | Ubuntu 22.04.3 LTS |
Kernel | 5.15.0-57-generic |
CPU Model | Intel® Xeon® Gold 6336Y CPU @ 2.40GHz |
Number of Sockets | 2 |
CPU Cores per Socket | 24 |
Hyper Threading | ON |
Turbo Boost | ON |
Automatic NUMA Balancing | Enabled |
CPU Frequency Governor | Performance |
TDP per CPU | 185W |
Installed Memory | 128GB (8x16GB 3200 MT/s DDR4) |
NIC | 2x Intel® Ethernet Controller X710 for 10GBASE-T, 1x Intel® 82574L Gigabit Ethernet Controller |
Storage | 1x Intel® SSDSC2KG960G8, 1x Samsung® 870 EVO 1TB SSD |
You can find the latest performance data on other Intel® AI Data Center Products such as 3rd, 4th, and 5th Gen Intel® Xeon® Scalable processors via Performance Data for Intel® AI Data Center Products.