Performance Data

Overview
Models
- Training Workloads
- Inference Workloads
Training Accuracy Results
- Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550
Training Performance Results
- Training Performance on 1-node of 4x Intel Data Center GPU Max 1550
Inference Performance Results
- Inference Performance on 1x Intel Data Center GPU Flex 170
Configuration
- Software Configuration
  - Software Configuration for Intel Max 1550 GPU
  - Software Configuration for Intel Flex 170 GPU
- Hardware Configuration
  - Hardware Configuration for Intel Max 1550 GPU
  - Hardware Configuration for Intel Flex 170 GPU
Additional Performance Data for Intel AI Data Center Products

Overview

This document demonstrates the training and inference performance as well as accuracy results on several popular AI workloads with Intel® Extension for TensorFlow* benchmarked on Intel GPUs. You can easily reproduce these results following the guidlines in examples.

Models

The following tables provide the links where you can get the original code repository and step-by-step guide running on Intel GPUs for each model.

Training Workloads

Model	Original Model Repo	ITEX Step-by-Step Guide
ResNet50v1.5	TensorFlow-Models/ResNet50v1.5	Resnet50 train on Intel GPU
BERT-Large	DeepLearningExamples/BERT	Accelerate BERT-Large Pretraining on Intel GPU
Mask-RCNN	DeepLearningExamples/Mask-RCNN	Accelerate Mask R-CNN Training on Intel GPU
3D-UNet	DeepLearningExamples/3D-UNet	Accelerate 3D-UNet Training for medical image segmentation on Intel GPU

Inference Workloads

Model	Original Model Repo	ITEX Step-by-Step Guide
ResNet50v1.5	Intel-Reference-Models/ResNet50v1.5	ResNet50v1.5 Model Inference with Intel® Extention for TensorFlow*
EfficientNet-B0	Keras-Applications/EfficientNet	Use the exact same codes and instructions as in the orignal model repo
EfficientNet-B3	Keras-Applications/EfficientNet	Use the exact same codes and instructions as in the orignal model repo
Mask-RCNN	DeepLearningExamples/Mask-RCNN	Use the exact same codes and instructions as in the orignal model repo
Stable Diffusion v1-4	KerasCV/Stable-Diffusion	Stable Diffusion Inference for Text2Image on Intel GPU

Training Accuracy Results

Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550

The following table shows the BERT-Large performance, training loss and time-to-train (TTT) results for both the pre-training and fine-tuning phases on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU).

	Pre-training Phase1	Pre-training Phase2	Fine-Tuning
Dataset	Wikipedia and BookCorpus	Wikipedia and BookCorpus	SQuAD 1.1
Maximum Sequence Length	128	512	384
Data Type	BF16	BF16	BF16
Throughput (sequences/sec)	3265.35	699.25	523.55
Time to Train (hours)	39.32	20.40	0.67
Loss	1.6047	1.3870	0.6867

Training Performance Results

Training Performance on 1-node of 4x Intel Data Center GPU Max 1550

The following tables show the performance numbers for several popular training workloads on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). For these workloads, we enable and benchmark both FP32 training and BF16 automatic mixed precision (AMP) training with 1-Stack of 1x Max 1550, 2-Stack of 1x Max 1550 as well as 4x Max 1550 (with 8 Stacks in total), to showcase the performance boost and scalability with Intel® Extension for TensorFlow* and Intel® Optimization for Horovod*.

Note: The training performance result on each workload below for 1x Max 1550 w/ 1-Stack represents the minimum value of the performance results on 2 stacks of single GPU, with 2 instances initiated simultaneously, while each stack of the GPU executing the workload separately, without distributed training.

ResNet50v1-5 Training Performance Results

GPUs	Ranks	Local Batch Size: FP32, BF16	Training Steps	Throughput w/ TF32 (images/sec)	Throughput w/ BF16 (images/sec)	Throughput Speedup w/ AMP	Weak Scaling w/ TF32	Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack	1	256, 512	5000	918.96	1766.53	1.92x	1.00	1.00
1x Max 1550 w/ 2-Stack	2	256, 512	5000	1762.76	3461.86	1.96x	1.92	1.96
4x Max 1550	8	256, 256	5000	NA	12278.32	NA	NA	6.95

BERT-Large Phase2 Training Performance Results

GPUs	Ranks	Local Batch Size x Accumulation Steps	Training Steps	Throughput w/ TF32 (sequences/sec)	Throughput w/ BF16 (sequences/sec)	Throughput Speedup w/ AMP	Weak Scaling w/ TF32	Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack	1	32 x 30	20	36.22	93.22	2.57x	1.00	1.00
1x Max 1550 w/ 2-Stack	2	32 x 30	20	74.40	182.57	2.45x	2.05	1.96
4x Max 1550	8	32 x 30	20	NA	692.11	NA	NA	7.42

Mask-RCNN Training Performance Results

GPUs	Ranks	Local Batch Size	Training Steps	Throughput w/ BF16 (images/sec)	Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack	1	4	20	29.03	1.00
1x Max 1550 w/ 2-Stack	2	4	20	55.51	1.91

Medical Image 3D U-Net Training Performance Results

GPUs	Ranks	Local Batch Size	Training Steps	Throughput w/ BF16 (samples/sec)	Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack	1	1	1000	12.81	1.00
1x Max 1550 w/ 2-Stack	2	1	1000	23.56	1.84
4x Max 1550	8	1	1000	87.07	6.80

Inference Performance Results

Inference Performance on 1x Intel Data Center GPU Flex 170

The following tables show the performance numbers for several popular inference workloads on 1x Intel® Data Center GPU Flex 170 (150W PCIe, 1-stack for each GPU).

Note: Inference with online mode refers to running the workloads using 1 as the batch size, while inference with batch mode utilizes larger batch size.

ResNet50v1-5 Inference Performance Results

GPUs	Dataset	Image Size	Mode	Batch Size	Data Type	Inference Steps	Throughput (images/sec)
1x Flex 170	Dummy	224x224	Online	1	INT8	5000	435.01
1x Flex 170	Dummy	224x224	Batch	1024	INT8	5000	9842.75

EfficientNet-B0 Inference Performance Results

GPUs	Dataset	Image Size	Mode	Batch Size	Data Type	Inference Steps	Throughput (images/sec)
1x Flex 170	Dummy	224x224	Batch	64	FP16 (AMP)	50	3007.60
1x Flex 170	Dummy	224x224	Batch	128	FP16 (AMP)	50	3587.29

EfficientNet-B3 Inference Performance Results

GPUs	Dataset	Image Size	Mode	Batch Size	Data Type	Inference Steps	Throughput (images/sec)
1x Flex 170	Dummy	300x300	Batch	64	FP16 (AMP)	50	928.56
1x Flex 170	Dummy	300x300	Batch	128	FP16 (AMP)	50	968.83

Mask-RCNN Inference Performance Results

GPUs	Dataset	Mode	Batch Size	Data Type	Inference Steps	Throughput (images/sec)
1x Flex 170	COCO 2017	Online	1	FP16 (AMP)	5000	19.38
1x Flex 170	COCO 2017	Batch	16	FP16 (AMP)	312	43.02

Stable Diffusion v1-4 Inference Performance Results

GPUs	Dataset	Output Image Size	Mode	Batch Size	Data Type	Diffusion Steps	Throughput (iterations/sec)	Throughput Speedup w/ FP16
1x Flex 170	Text Prompt	512x512	Online	1	FP32	50	2.91	1.00x
1x Flex 170	Text Prompt	512x512	Online	1	FP16 (pure)	50	6.53	2.24x

Configuration

Software Configuration

Software Configuration for Intel Max 1550 GPU

Software Component	Version
GPU Driver	736.25
Intel® oneAPI Base Toolkit	2024.0
TensorFlow	v2.14.0
Intel® Extension for TensorFlow*	v2.14.0.1
Intel® Optimization for Horovod*	v0.28.1.2

Software Configuration for Intel Flex 170 GPU

Software Component	Version
GPU Driver	736.25
Intel® oneAPI Base Toolkit	2024.0
TensorFlow	v2.14.0
Intel® Extension for TensorFlow*	v2.14.0.1

Hardware Configuration

Hardware Configuration for Intel Max 1550 GPU

GPU System	4x Intel® Data Center GPU Max 1550
Number of Nodes	1
Xe®-Cores per GPU	128 in total 2-Stack
Memory Size per GPU	128 GB HBM2e in total 2-Stack
TDP per GPU	600W
GPU ECC Setting	OFF
Server Board	Intel® Denali Pass D50DNP1SBB
OS	SUSE Linux Enterprise Server 15 SP4
Kernel	5.14.21-150400.24.69-default
CPU Model	Intel® Xeon® Platinum 8480+ @ 2.00 GHz
Number of Sockets	2
CPU Cores per Socket	56
Hyper Threading	ON
Turbo Boost	ON
Automatic NUMA Balancing	Enabled
CPU Frequency Governor	Performance
TDP per CPU	350W
Installed Memory	1024GB (16x64GB 4800 MT/s DDR5)
NIC	1x Intel® Ethernet Controller X710 for 10GBASE-T
Storage	1x WD® WD_BLACK SN850X 2TB NVMe SSD

Hardware Configuration for Intel Flex 170 GPU

GPU System	1x Intel® Data Center GPU Flex 170
Number of Nodes	1
Xe®-Cores per GPU	32
Memory Size per GPU	16 GB GDDR6
TDP per GPU	150W
GPU ECC Setting	ON
Server Board	Intel® Whitley
OS	Ubuntu 22.04.3 LTS
Kernel	5.15.0-57-generic
CPU Model	Intel® Xeon® Gold 6336Y CPU @ 2.40GHz
Number of Sockets	2
CPU Cores per Socket	24
Hyper Threading	ON
Turbo Boost	ON
Automatic NUMA Balancing	Enabled
CPU Frequency Governor	Performance
TDP per CPU	185W
Installed Memory	128GB (8x16GB 3200 MT/s DDR4)
NIC	2x Intel® Ethernet Controller X710 for 10GBASE-T, 1x Intel® 82574L Gigabit Ethernet Controller
Storage	1x Intel® SSDSC2KG960G8, 1x Samsung® 870 EVO 1TB SSD

Additional Performance Data for Intel AI Data Center Products

You can find the latest performance data on other Intel® AI Data Center Products such as 3rd, 4th, and 5th Gen Intel® Xeon® Scalable processors via Performance Data for Intel® AI Data Center Products.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!