Performance Benchmarks of CK #485

zjing14 · 2022-10-21T17:27:35Z

zjing14
Oct 21, 2022

Benchmarks of AIT+CK on AMD MI250 GPUs vs. TensorRT (TRT) and FasterTransformer (FT) on NVIDIA A100 GPUs

System Information

	4xMI250 platform
System model	Supermicro H12DGQ-NT6
System BIOS	2.3
CPU	2xAMD EPYC 7713 64-Core Processor (w socket, 64 cores per socket, 2 thread per core), configured with 1 NUMA node per socket
Memory	2 TiB (32 DIMMs, 2933 mts, 64 GiB/DIMM, 16 channels, 2 DIMM/channel)
Disk	Root filesystem on a single 3.5TB NVMe (INTEL_SSDPF2KX038TZ).
GPU	4x AMD Instinct MI250 OAM (128 GB HBM2e) 560W GPU with AMD Infinity Fabric technology
Host OS	CentOS Stream release 8 with 5.12.0-0_fbk5_zion_rc1_5697_g2c723fb88626
Host GPU driver	ROCm-5.3.0, Driver version: 5.13.11.21.50
Software config	ROCm repo: http://repo.radeon.com/rocm/apt/.apt\_5.3 AIT repo: https://github.com/ROCmSoftwarePlatform/AITemplate.git, commit: f940d9b7ac8b976fba127e2c269dc5b368f30e4e CK repo: https://github.com/ROCmSoftwarePlatform/composable\_kernel.git, commit: `40942b9`

	4xA100 PCIe 40GB
System model	GIGABYTE G482-Z52-00
System BIOS	R14c_MS
CPU	2xAMD EPYC 7742 64-Core Processor (2 socket, 64 cores per socket, 2 thread per core), configured with 4 NUMA node per socket
Memory	1024 GiB (32 DIMMs, 2933 mts, 32 GiB/DIMM, 16 channels, 2 DIMM/channel)
Disk	Root filesystem on a single 931.5G SATA (Samsung SSD 860).
GPU	4x Nvidia A100-PCIe-40GB 250W GPU
Host OS	Ubuntu 20.04.3 LTS with Linux kernel 5.4.0-90-generic
Host GPU driver	CUDA 11.4, driver version 470.86
Software config	NVIDIA NGC docker: nvcr.io/nvidia/pytorch:22.09-py3 (NVIDIA CUDA 11.8.0, TensorRT 8.5.0.12, PyTorch 1.13.0a0+d0d6b1f) pyTorch-TensorRT repo: https://github.com/pytorch/TensorRT.git, release: v1.2.0 FasterTransformer repo: https://github.com/NVIDIA/FasterTransformer.git, release: v5.1.1 bug fix

	A100 DGX 80GB
System model	NVIDIA DGXA100
System BIOS	0.3
CPU	2xAMD EPYC 7742 64-Core Processor (2 socket, 64 cores per socket, 2 threads per core), configured with 4 NUMA nodes per socket
Memory	2 TiB (32 DIMMs, 2933 mts, 64 GiB/DIMM, 16 channels, 2 DIMMs/channel)
Disk	Root filesystem in a RAID-1 pair of 1.8TB NVMe drives (SAMSUNG_MZ1LB1T9HALS-00007).
GPU	8x NVIDIA A100 SXM 80GB (400W)
Host OS	Ubuntu 20.04.4 LTS with Linux kernel 5.4.0-105-generic
Host GPU driver	CUDA 11.6 and Driver Version 510.85.02
Software config	NVIDIA NGC docker: nvcr.io/nvidia/pytorch:22.09-py3 (NVIDIA CUDA 11.8.0, TensorRT 8.5.0.12, PyTorch 1.13.0a0+d0d6b1f) pyTorch-TensorRT repo: https://github.com/pytorch/TensorRT.git, release: v1.2.0 FasterTransformer repo: https://github.com/NVIDIA/FasterTransformer.git, release: v5.1.1 bug fix

Benchmark Results

ResNet50 - fp16, Average QPS (image per sec)

Batch size	TRT - A100-DGx-80GB	TRT - A100-PCIe-40GB	AIT+CK - MI250
128	17181	13919	14484
256	17909	14401	15591

BERT-base - fp16, Average QPS (sequence per sec)

Sequece length	Batch size	FT - A100-DGX-80GB	FT - A100-PCIe-40GB	AIT+CK - MI250
64	256	15116	12483	13833
128	256	7564	6163	7495
384	256	2385	1954	2430
1024	256	452	379	800
4096	16	39	35	115
4096	128	Out-of-Memory	Out-of-Memory	126

VIT (224x224, patch 16) - fp16, Average QPS (image per sec)

Batch size	TRT - A100-DGX-80GB	TRT - A100-PCIe-40GB	AIT+CK - MI250
128	3071	2388	4302
256	3115	2502	4345

Stable Diffusion, Average Latency (ms)

Batch size	AIT + CK - MI250
1	2604
2	2604
4	3951
6	5368

UNet in Stable Diffusion - fp16, Average Latency (ms)

Batch size	TRT - A100-DGX-80GB	TRT - A100-PCIe-40GB	AIT+CK - MI250
1	40.6	46.7	48.5
2	72.0	84.5	48.5
4	136.7	162.3	74.0
6	202.5	243.4	99.4

Instructions for benchmarking AIT+CK on AMD MI250

AIT+CK Setup

clone AITemplate and CK repo

git clone --recursive https://github.com/ROCmSoftwarePlatform/AITemplate.git
cd AITemplate
git checkout 3b6a19506bb269e9f8df82d8d3d347dda0085543
cd 3rdparty/composable_kernel/
git checkout 40942b9
cd ../..

build the docker

DOCKER_BUILDKIT=1 ./docker/build.sh rocm

run the docker

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME:/dockerx/'
drun ait:latest

set env

export ROCM_PATH=/opt/rocm
export ROC_USE_FGS_KERNARG=0

clean up and reinstall ait

cd /dockerx/AITemplate
pip3 uninstall -y aitemplate
cd python
rm -rf dist build
python3 setup.py bdist_wheel
pip3 install dist/*.whl

benchmarks of ResNet50

cd examples/01_resnet-50/

#profiling
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 benchmark_ait.py

# test 2 gcd
for BATCH_SIZE in 1 2 4 8 16 32 64 128 256
do
    HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size $BATCH_SIZE &
    HIP_VISIBLE_DEVICES=1 python3 benchmark_ait.py --batch-size $BATCH_SIZE && fg
done

benchmarks of VIT

cd examples/04_vit/

# profiling
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 benchmark_ait.py

# test 2 gcd
for BATCH_SIZE in 1 2 4 8 16 32 64 128 256
do
    HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size $BATCH_SIZE &
    HIP_VISIBLE_DEVICES=1 python3 benchmark_ait.py --batch-size $BATCH_SIZE && fg
done

benchmarks of BERT

cd examples/03_bert/

python3 -m pip install transformers click torch

# profiling
HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 benchmark_ait.py

# test 2 gcd
for SEQ_LENGTH in 64 128 384 1024 4096
do
for BATCH_SIZE in 1 2 4 8 16 32 64 128 256
do
    HIP_VISIBLE_DEVICES=0 python3 benchmark_ait.py --batch-size $BATCH_SIZE --seq-length $SEQ_LENGTH &
    HIP_VISIBLE_DEVICES=1 python3 benchmark_ait.py --batch-size $BATCH_SIZE --seq-length $SEQ_LENGTH && fg
done
done

benchmarks of Stable Diffusion

python3 -m pip install transformers click torch diffusers


for BATCH_SIZE in 1 2 3
do
    # build model 
    HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 examples/05_stable_diffusion/compile.py --token ACCESS_TOKEN --batch-size $BATCH_SIZE

    # test 2 gcd
    HIP_VISIBLE_DEVICES=0 python3 examples/05_stable_diffusion/demo.py --token ACCESS_TOKEN --benchmark 1 --batch-size $BATCH_SIZE &
    HIP_VISIBLE_DEVICES=1 python3 examples/05_stable_diffusion/demo.py --token ACCESS_TOKEN --benchmark 1 --batch-size $BATCH_SIZE && fg
done

benchmarks of UNet

python3 -m pip install transformers click torch diffusers

for BATCH_SIZE in 1 2 3
do
    # build model 
    HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 examples/05_stable_diffusion/compile.py --token ACCESS_TOKEN --batch-size $BATCH_SIZE

    # test 2 gcd
    HIP_VISIBLE_DEVICES=0 python3 examples/05_stable_diffusion/benchmark.py --token ACCESS_TOKEN --batch-size $BATCH_SIZE &
    HIP_VISIBLE_DEVICES=1 python3 examples/05_stable_diffusion/benchmark.py --token ACCESS_TOKEN --batch-size $BATCH_SIZE && fg
done

Instructions for benchmarking TRT/FT on NVIDIA platforms

benchmarks of ResNet50 and VIT with TensorRT on A100

run docker

docker run --gpus all -it --rm -v $HOME:/dockerx/ nvcr.io/nvidia/pytorch:22.09-py3

Clone PyTorch-TensorRT Repo

git clone https://github.com/pytorch/TensorRT.git
cd TensorRT/tools/perf

Install dependencies

pip3 install timm
pip3 install transformers

apply the following fix to tools/perf/perf_run.py

diff --git a/tools/perf/perf_run.py b/tools/perf/perf_run.py
index fbdf3b6c..3ed2cdcb 100644
--- a/tools/perf/perf_run.py
+++ b/tools/perf/perf_run.py
@@ -15,7 +15,7 @@ import pandas as pd
 # Importing supported Backends
 import torch
 import torch_tensorrt as torchtrt
-from torch_tensorrt.fx.lower import compile
+from torch_tensorrt.fx.lower import lower_to_trt
 from torch_tensorrt.fx.utils import LowerPrecision

Run benchmarks

chmod +x benchmark.sh
./benchmark.sh 2>&1 | tee resnet_vit_log

benchmarks of BERT with FasterTransformer on A100

Run docker

docker run --gpus all -it --rm -v $HOME:/dockerx/ nvcr.io/nvidia/pytorch:22.09-py3

Clone FasterTransformer repo

git clone https://github.com/NVIDIA/FasterTransformer.git
cd FasterTransformer/

Build FasterTransformer

apt-get install bc
mkdir build && cd build
cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DSPARSITY_SUPPORT=OFF -DBUILD_TRT=ON ..
make -j

Run benchmarks

./build/bin/bert_example 256 12 64 12 64 1 0
./build/bin/bert_example 256 12 128 12 64 1 0
./build/bin/bert_example 256 12 384 12 64 1 0
./build/bin/bert_example 256 12 1024 12 64 1 0
./build/bin/bert_example 16 12 4096 12 64 1 0

benchmarks of UNet with TensorRT on A100

Source: https://www.photoroom.com/tech/stable-diffusion-25-percent-faster-and-save-seconds/
Run docker

docker run --gpus all -it --rm -v $HOME:/dockerx/ nvcr.io/nvidia/pytorch:22.09-py3

Install dependencies

pip3 install onnxsim
pip3 install onnxruntime
pip3 install diffusers

create unet_export.py (changing bs = 1 for different batch size; replace ACCESS_TOKEN with your Hugging Face TOKEN)

import onnx
import torch
from diffusers import UNet2DConditionModel

unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4",
                                            torch_dtype=torch.float16,
                                            revision="fp16",
                                            subfolder="unet",
                                            use_auth_token=ACCESS_TOKEN)
unet.cuda()
bs = 1 #batch size
with torch.inference_mode(), torch.autocast("cuda"):
    inputs = torch.randn(bs * 2,4,64,64, dtype=torch.half, device='cuda'), torch.randn(1, dtype=torch.half, device='cuda'), torch.randn(bs * 2, 64, 768, dtype=torch.half, device='cuda')

    # Export the model
    torch.onnx.export(unet,               # model being run
                    inputs,                         # model input (or a tuple for multiple inputs)
                    "unet_v1_4_fp16_pytorch.onnx",   # where to save the model (can be a file or file-like object)
                    export_params=True,        # store the trained parameter weights inside the model file
                    opset_version=12,          # the ONNX version to export the model to
                    do_constant_folding=True,  # whether to execute constant folding for optimization
                    input_names = ['input_0', 'input_1', 'input_2'],
                    output_names = ['output_0'])

Build TensorRT engine for UNet

python3 unet_export.py
python3 -m onnxsim unet_v1_4_fp16_pytorch.onnx unet_v1_4_fp16_pytorch_sim.onnx
trtexec --onnx=unet_v1_4_fp16_pytorch_sim.onnx --saveEngine=unet_v1_4_fp16_pytorch_sim.trt --fp16

Run TensorRT engine for UNet

trtexec --loadEngine=unet_v1_4_fp16_pytorch_sim.trt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Benchmarks of CK #485

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Performance Benchmarks of CK #485

zjing14 Oct 21, 2022

Benchmarks of AIT+CK on AMD MI250 GPUs vs. TensorRT (TRT) and FasterTransformer (FT) on NVIDIA A100 GPUs

System Information

Benchmark Results

ResNet50 - fp16, Average QPS (image per sec)

BERT-base - fp16, Average QPS (sequence per sec)

VIT (224x224, patch 16) - fp16, Average QPS (image per sec)

Stable Diffusion, Average Latency (ms)

UNet in Stable Diffusion - fp16, Average Latency (ms)

Instructions for benchmarking AIT+CK on AMD MI250

AIT+CK Setup

benchmarks of ResNet50

benchmarks of VIT

benchmarks of BERT

benchmarks of Stable Diffusion

benchmarks of UNet

Instructions for benchmarking TRT/FT on NVIDIA platforms

benchmarks of ResNet50 and VIT with TensorRT on A100

benchmarks of BERT with FasterTransformer on A100

benchmarks of UNet with TensorRT on A100

Replies: 0 comments

zjing14
Oct 21, 2022