Skip to content

Latest commit

 

History

History
235 lines (178 loc) · 12.9 KB

performance.md

File metadata and controls

235 lines (178 loc) · 12.9 KB

Performance Data

Overview

This document demonstrates the training and inference performance as well as accuracy results on several popular AI workloads with Intel® Extension for TensorFlow* benchmarked on Intel GPUs. You can easily reproduce these results following the guidlines in examples.

Models

The following tables provide the links where you can get the original code repository and step-by-step guide running on Intel GPUs for each model.

Training Workloads

Model Original Model Repo ITEX Step-by-Step Guide
ResNet50v1.5 TensorFlow-Models/ResNet50v1.5 Resnet50 train on Intel GPU
BERT-Large DeepLearningExamples/BERT Accelerate BERT-Large Pretraining on Intel GPU
Mask-RCNN DeepLearningExamples/Mask-RCNN Accelerate Mask R-CNN Training on Intel GPU
3D-UNet DeepLearningExamples/3D-UNet Accelerate 3D-UNet Training for medical image segmentation on Intel GPU

Inference Workloads

Model Original Model Repo ITEX Step-by-Step Guide
ResNet50v1.5 Intel-Reference-Models/ResNet50v1.5 ResNet50v1.5 Model Inference with Intel® Extention for TensorFlow*
EfficientNet-B0 Keras-Applications/EfficientNet Use the exact same codes and instructions as in the orignal model repo
EfficientNet-B3 Keras-Applications/EfficientNet Use the exact same codes and instructions as in the orignal model repo
Mask-RCNN DeepLearningExamples/Mask-RCNN Use the exact same codes and instructions as in the orignal model repo
Stable Diffusion v1-4 KerasCV/Stable-Diffusion Stable Diffusion Inference for Text2Image on Intel GPU

Training Accuracy Results

Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550

The following table shows the BERT-Large performance, training loss and time-to-train (TTT) results for both the pre-training and fine-tuning phases on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU).

Pre-training Phase1 Pre-training Phase2 Fine-Tuning
Dataset Wikipedia and BookCorpus Wikipedia and BookCorpus SQuAD 1.1
Maximum Sequence Length 128 512 384
Data Type BF16 BF16 BF16
Throughput (sequences/sec) 3265.35 699.25 523.55
Time to Train (hours) 39.32 20.40 0.67
Loss 1.6047 1.3870 0.6867

Training Performance Results

Training Performance on 1-node of 4x Intel Data Center GPU Max 1550

The following tables show the performance numbers for several popular training workloads on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). For these workloads, we enable and benchmark both FP32 training and BF16 automatic mixed precision (AMP) training with 1-Stack of 1x Max 1550, 2-Stack of 1x Max 1550 as well as 4x Max 1550 (with 8 Stacks in total), to showcase the performance boost and scalability with Intel® Extension for TensorFlow* and Intel® Optimization for Horovod*.

Note: The training performance result on each workload below for 1x Max 1550 w/ 1-Stack represents the minimum value of the performance results on 2 stacks of single GPU, with 2 instances initiated simultaneously, while each stack of the GPU executing the workload separately, without distributed training.

ResNet50v1-5 Training Performance Results

GPUs Ranks Local Batch Size:
FP32, BF16
Training
Steps
Throughput w/
TF32 (images/sec)
Throughput w/
BF16 (images/sec)
Throughput Speedup
w/ AMP
Weak Scaling
w/ TF32
Weak Scaling
w/ BF16
1x Max 1550 w/ 1-Stack 1 256, 512 5000 918.96 1766.53 1.92x 1.00 1.00
1x Max 1550 w/ 2-Stack 2 256, 512 5000 1762.76 3461.86 1.96x 1.92 1.96
4x Max 1550 8 256, 256 5000 NA 12278.32 NA NA 6.95

BERT-Large Phase2 Training Performance Results

GPUs Ranks Local
Batch Size
x Accumulation Steps
Training
Steps
Throughput
w/ TF32
(sequences/sec)
Throughput
w/ BF16
(sequences/sec)
Throughput Speedup
w/ AMP
Weak Scaling
w/ TF32
Weak Scaling
w/ BF16
1x Max 1550 w/ 1-Stack 1 32 x 30 20 36.22 93.22 2.57x 1.00 1.00
1x Max 1550 w/ 2-Stack 2 32 x 30 20 74.40 182.57 2.45x 2.05 1.96
4x Max 1550 8 32 x 30 20 NA 692.11 NA NA 7.42

Mask-RCNN Training Performance Results

GPUs Ranks Local Batch Size Training Steps Throughput w/ BF16 (images/sec) Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack 1 4 20 29.03 1.00
1x Max 1550 w/ 2-Stack 2 4 20 55.51 1.91

Medical Image 3D U-Net Training Performance Results

GPUs Ranks Local Batch Size Training Steps Throughput w/ BF16 (samples/sec) Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack 1 1 1000 12.81 1.00
1x Max 1550 w/ 2-Stack 2 1 1000 23.56 1.84
4x Max 1550 8 1 1000 87.07 6.80

Inference Performance Results

Inference Performance on 1x Intel Data Center GPU Flex 170

The following tables show the performance numbers for several popular inference workloads on 1x Intel® Data Center GPU Flex 170 (150W PCIe, 1-stack for each GPU).

Note: Inference with online mode refers to running the workloads using 1 as the batch size, while inference with batch mode utilizes larger batch size.

ResNet50v1-5 Inference Performance Results

GPUs Dataset Image Size Mode Batch Size Data Type Inference Steps Throughput (images/sec)
1x Flex 170 Dummy 224x224 Online 1 INT8 5000 435.01
1x Flex 170 Dummy 224x224 Batch 1024 INT8 5000 9842.75

EfficientNet-B0 Inference Performance Results

GPUs Dataset Image Size Mode Batch Size Data Type Inference Steps Throughput (images/sec)
1x Flex 170 Dummy 224x224 Batch 64 FP16 (AMP) 50 3007.60
1x Flex 170 Dummy 224x224 Batch 128 FP16 (AMP) 50 3587.29

EfficientNet-B3 Inference Performance Results

GPUs Dataset Image Size Mode Batch Size Data Type Inference Steps Throughput (images/sec)
1x Flex 170 Dummy 300x300 Batch 64 FP16 (AMP) 50 928.56
1x Flex 170 Dummy 300x300 Batch 128 FP16 (AMP) 50 968.83

Mask-RCNN Inference Performance Results

GPUs Dataset Mode Batch Size Data Type Inference Steps Throughput (images/sec)
1x Flex 170 COCO 2017 Online 1 FP16 (AMP) 5000 19.38
1x Flex 170 COCO 2017 Batch 16 FP16 (AMP) 312 43.02

Stable Diffusion v1-4 Inference Performance Results

GPUs Dataset Output
Image Size
Mode Batch Size Data Type Diffusion Steps Throughput
(iterations/sec)
Throughput Speedup
w/ FP16
1x Flex 170 Text Prompt 512x512 Online 1 FP32 50 2.91 1.00x
1x Flex 170 Text Prompt 512x512 Online 1 FP16 (pure) 50 6.53 2.24x

Configuration

Software Configuration

Software Configuration for Intel Max 1550 GPU

Software Component Version
GPU Driver 736.25
Intel® oneAPI Base Toolkit 2024.0
TensorFlow v2.14.0
Intel® Extension for TensorFlow* v2.14.0.1
Intel® Optimization for Horovod* v0.28.1.2

Software Configuration for Intel Flex 170 GPU

Software Component Version
GPU Driver 736.25
Intel® oneAPI Base Toolkit 2024.0
TensorFlow v2.14.0
Intel® Extension for TensorFlow* v2.14.0.1

Hardware Configuration

Hardware Configuration for Intel Max 1550 GPU

GPU System 4x Intel® Data Center GPU Max 1550
Number of Nodes 1
Xe®-Cores per GPU 128 in total 2-Stack
Memory Size per GPU 128 GB HBM2e in total 2-Stack
TDP per GPU 600W
GPU ECC Setting OFF
Server Board Intel® Denali Pass D50DNP1SBB
OS SUSE Linux Enterprise Server 15 SP4
Kernel 5.14.21-150400.24.69-default
CPU Model Intel® Xeon® Platinum 8480+ @ 2.00 GHz
Number of Sockets 2
CPU Cores per Socket 56
Hyper Threading ON
Turbo Boost ON
Automatic NUMA Balancing Enabled
CPU Frequency Governor Performance
TDP per CPU 350W
Installed Memory 1024GB (16x64GB 4800 MT/s DDR5)
NIC 1x Intel® Ethernet Controller X710 for 10GBASE-T
Storage 1x WD® WD_BLACK SN850X 2TB NVMe SSD

Hardware Configuration for Intel Flex 170 GPU

GPU System 1x Intel® Data Center GPU Flex 170
Number of Nodes 1
Xe®-Cores per GPU 32
Memory Size per GPU 16 GB GDDR6
TDP per GPU 150W
GPU ECC Setting ON
Server Board Intel® Whitley
OS Ubuntu 22.04.3 LTS
Kernel 5.15.0-57-generic
CPU Model Intel® Xeon® Gold 6336Y CPU @ 2.40GHz
Number of Sockets 2
CPU Cores per Socket 24
Hyper Threading ON
Turbo Boost ON
Automatic NUMA Balancing Enabled
CPU Frequency Governor Performance
TDP per CPU 185W
Installed Memory 128GB (8x16GB 3200 MT/s DDR4)
NIC 2x Intel® Ethernet Controller X710 for 10GBASE-T,
1x Intel® 82574L Gigabit Ethernet Controller
Storage 1x Intel® SSDSC2KG960G8,
1x Samsung® 870 EVO 1TB SSD

Additional Performance Data for Intel AI Data Center Products

You can find the latest performance data on other Intel® AI Data Center Products such as 3rd, 4th, and 5th Gen Intel® Xeon® Scalable processors via Performance Data for Intel® AI Data Center Products.