-
-
Notifications
You must be signed in to change notification settings - Fork 314
Performance Benchmarks
Our design philosophy is to provide high quality compact models balancing between capacity and adequate quality.
It is widely known (also confirmed by our own research) that you can get incremental improvements by scaling your model further 2x, 5x, 10x. But we believe firmly that performance gains should be achieved on similar or lower computation budget.
There are novel techniques that might enable packing models close to EE performance into packages as small as 10-20 MB, but for now the lowest we could achieve was about 50 MB for EE models.
It depends on a particular models, but typically we offer the following approximate models sizes (size may vary for a particular model):
-
small
, 50-200 MB; -
xsmall
, 50-100 MB, 20-30 MB quantized; -
xxsmall
, 25-50 MB, 10-15 MB quantized; -
large
, 300-500+ MB;
It is customary to publish Multiply-Adds or FLOPS as a measure of compute required, but we prefer just sharing model sizes and tests on commodity hardware.
All of the below benchmarks and estimates were run on 6 cores (12 threads) of AMD Ryzen Threadripper 1920X 12-Core Processor (3500 МHz). Scale accordingly for your device. These tests are just run as is using native PyTorch, without any clever batching / concurrency techniques. You are welcome to submit your test results!
Test procedure:
- We take 100 10-second audio files
- Split into batches of 1, 5, 10, 25 files
- Measure how long it takes to process a batch of a given size on CPU
- On GPU our models are so fast that batch size and audio length do no really matter (in practical cases)
- We measure how many seconds of audio per second one processor core can process. This is similar to
1 / RTF
per core
We report results for the following types of models:
- FP32 (baseline)
- FP32 + Fused (CE v1)
- FP32 + INT8
- FP32 Fused + INT8
- Full INT8 + Fused (EE, small)
- Best /
xsmall
(EE, xsmall, quantized, compiled, further improved and optimized) -
xxsmall
- cutting edge model, used inEE
distros
Seconds of audio per second per core (1 / RTF
per core):
Batch size | FP32 | FP32 + Fused | FP32 + INT8 | FP32 Fused + INT8 | Full INT8 + Fused | New Best (xsmall) | xxsmall |
---|---|---|---|---|---|---|---|
1 | 7.7 | 8.8 | 8.8 | 9.1 | 11.0 | 22.6 | 33.8 |
5 | 11.8 | 13.6 | 13.6 | 15.6 | 17.5 | 29.8 | 50.4 |
10 | 12.8 | 14.6 | 14.6 | 16.7 | 18.0 | 29.3 | 53.3 |
25 | 12.9 | 14.9 | 14.9 | 17.9 | 18.7 | 29.8 | 49.2 |
We are not yet decided on which speed improvements should trickle down from EE to CE for which languages.
Sizing | Minimal | Recommended |
---|---|---|
Disk | NVME, 256+ GB | NVME, 256+ GB |
RAM | 16 GB | 16 GB |
CPU cores | 8+ | 12+ |
Core frequency | 3+ GHz | 3.5+ GHz |
Hyper threading | + | + |
AVX2 instructions | Not necessary | Not necessary |
Compatible GPUs | (*) | (*) |
GPU count | 1 | 1 |
Metrics | 8 "threads" | 16 "threads" |
Mean latency, ms | 280 | 320 |
95 percentile, ms | 430 | 476 |
99 percentile, ms | 520 | 592 |
Files per 1000 ms | 25.0 | 43.4 |
Files per 500 ms | 12.5 | 21.7 |
1 / RTF | 85.6 | 145.0 |
Billing / gRPC threads | 12 - 18 | 22 - 30 |
1 / RTF / CPU cores | 10.7 | 12.1 |
(*) Suitable GPUs:
- Any Nvidia GPUs higher than 1070 8+GB RAM (blower fan);
- Any single-slot Nvidia Quadro 8+GB RAM (TDP 100 - 150W) (blower fan or passive);
- Nvidia Tesla T4 (passive) TDP 75W;
Sizing | Minimal | Recommended |
---|---|---|
Disk | SSD, 256+ GB | SSD, 256+ GB |
RAM | 16 GB | 16 GB |
CPU cores | 8+ | 12+ |
Core frequency | 3.5+GHz+ | 3.5+ GHz |
Hyper threading | + | + |
AVX2 instructions | + | + |
Metrics | 8 "threads" | 16 "threads" |
Mean latency, ms | 320 | 470 |
95 percentile, ms | 580 | 760 |
99 percentile, ms | 720 | 890 |
Files per 1000 ms | 11.1 | 15.9 |
Files per 500 ms | 5.6 | 8.0 |
1 / RTF | 37.0 | 53.0 |
Billing / gRPC threads | 6 - 9 | 8 - 10 |
1 / RTF / CPU cores | 4.6 | 4.4 |
Speed is the next important defining property of the model, and to measure the speed of synthesis we use the following simple metrics:
- RTF (Real Time Factor) - time the synthesis takes divided by audio duration;
- RTS = 1 / RTF (Real Time Speed) - how much the synthesis is "faster" than realtime;
We benchmarked the models on two devices using Pytorch 1.8 utils:
- CPU - Intel i7-6800K CPU @ 3.40GHz;
- GPU - 1080 Ti;
- When measuring CPU performance, we also limited the number of threads used;
For the 16KHz models we got the following metrics:
BatchSize | Device | RTF | RTS |
---|---|---|---|
1 | CPU 1 thread | 0.7 | 1.4 |
1 | CPU 2 threads | 0.4 | 2.3 |
1 | CPU 4 threads | 0.3 | 3.1 |
4 | CPU 1 thread | 0.5 | 2.0 |
4 | CPU 2 threads | 0.3 | 3.2 |
4 | CPU 4 threads | 0.2 | 4.9 |
--- | ----------- | --- | --- |
1 | GPU | 0.06 | 16.9 |
4 | GPU | 0.02 | 51.7 |
8 | GPU | 0.01 | 79.4 |
16 | GPU | 0.008 | 122.9 |
32 | GPU | 0.006 | 161.2 |
--- | ----------- | --- | --- |
For the 8KHz models we got the following metrics:
BatchSize | Device | RTF | RTS |
---|---|---|---|
1 | CPU 1 thread | 0.5 | 1.9 |
1 | CPU 2 threads | 0.3 | 3.0 |
1 | CPU 4 threads | 0.2 | 4.2 |
4 | CPU 1 thread | 0.4 | 2.8 |
4 | CPU 1 threads | 0.2 | 4.4 |
4 | CPU 4 threads | 0.1 | 6.6 |
--- | ----------- | --- | --- |
1 | GPU | 0.06 | 17.5 |
4 | GPU | 0.02 | 55.0 |
8 | GPU | 0.01 | 92.1 |
16 | GPU | 0.007 | 147.7 |
32 | GPU | 0.004 | 227.5 |
--- | ----------- | --- | --- |
A number of things surprised us during benchmarking:
- AMD processors performed much worse;
- The bottleneck in our case was the tacotron, not the vocoder (there is still a lot of potential to make the whole model 3-4x faster, maybe even 10x);
- More than 4 CPU threads or batch size larger than 4 do not help;