Performance Benchmarks of CK #485
zjing14
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Benchmarks of AIT+CK on AMD MI250 GPUs vs. TensorRT (TRT) and FasterTransformer (FT) on NVIDIA A100 GPUs
System Information
(w socket, 64 cores per socket, 2 thread per core),
configured with 1 NUMA node per socket
AIT repo: https://github.com/ROCmSoftwarePlatform/AITemplate.git,
commit: f940d9b7ac8b976fba127e2c269dc5b368f30e4e
CK repo: https://github.com/ROCmSoftwarePlatform/composable\_kernel.git,
commit: 40942b9
(2 socket, 64 cores per socket, 2 thread per core),
configured with 4 NUMA node per socket
(NVIDIA CUDA 11.8.0, TensorRT 8.5.0.12, PyTorch 1.13.0a0+d0d6b1f)
pyTorch-TensorRT repo: https://github.com/pytorch/TensorRT.git, release: v1.2.0
FasterTransformer repo: https://github.com/NVIDIA/FasterTransformer.git, release: v5.1.1 bug fix
(2 socket, 64 cores per socket, 2 threads per core),
configured with 4 NUMA nodes per socket
(NVIDIA CUDA 11.8.0, TensorRT 8.5.0.12, PyTorch 1.13.0a0+d0d6b1f)
pyTorch-TensorRT repo: https://github.com/pytorch/TensorRT.git, release: v1.2.0
FasterTransformer repo: https://github.com/NVIDIA/FasterTransformer.git, release: v5.1.1 bug fix
Benchmark Results
ResNet50 - fp16, Average QPS (image per sec)
BERT-base - fp16, Average QPS (sequence per sec)
VIT (224x224, patch 16) - fp16, Average QPS (image per sec)
Stable Diffusion, Average Latency (ms)
UNet in Stable Diffusion - fp16, Average Latency (ms)
Instructions for benchmarking AIT+CK on AMD MI250
AIT+CK Setup
clone AITemplate and CK repo
build the docker
run the docker
set env
clean up and reinstall ait
benchmarks of ResNet50
benchmarks of VIT
benchmarks of BERT
benchmarks of Stable Diffusion
benchmarks of UNet
Instructions for benchmarking TRT/FT on NVIDIA platforms
benchmarks of ResNet50 and VIT with TensorRT on A100
run docker
Clone PyTorch-TensorRT Repo
Install dependencies
apply the following fix to
tools/perf/perf_run.py
Run benchmarks
benchmarks of BERT with FasterTransformer on A100
Run docker
Clone FasterTransformer repo
Build FasterTransformer
Run benchmarks
benchmarks of UNet with TensorRT on A100
Source: https://www.photoroom.com/tech/stable-diffusion-25-percent-faster-and-save-seconds/
Run docker
Install dependencies
create unet_export.py (changing
bs = 1
for different batch size; replaceACCESS_TOKEN
with your Hugging Face TOKEN)Build TensorRT engine for UNet
Run TensorRT engine for UNet
Beta Was this translation helpful? Give feedback.
All reactions