Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper : add full CUDA and Metal offloading #1472

Merged
merged 21 commits into from
Nov 12, 2023
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Nov 10, 2023

Build with:

# make
WHISPER_CUBLAS=1 make -j

# cmake
cmake -DWHISPER_CUBLAS=1 ../

Also, the convolution ops are now offloaded both with CUDA and Metal resulting in speed-up in the Encoder (#1473)
Credits and huge thanks to @FSSRepo: ggerganov/ggml#564

If you want to have some fun, try this:

# get the models:
./models/download-ggml-model.sh medium.en
wget https://huggingface.co/TheBloke/Llama-2-13B-GGUF/resolve/main/llama-2-13b.Q4_K_M.gguf -O ./models/llama-13b-v2-q4_k_m.gguf

---

# NVIDIA (16GB VRAM required)
WHISPER_CUBLAS=1 make -j talk-llama
./talk-llama -mw ./models/ggml-medium.en.bin -ml ./models/llama-13b-v2-q4_k_m.gguf -p "John" -t 8

---

# Apple (Metal)
make -j talk-llama
./talk-llama -mw ./models/ggml-medium.en.bin -ml ./models/llama-13b-v2-q4_k_m.gguf -p "John" -t 8

---

# Apple (CoreML + Metal)
./models/generate-coreml-model.sh medium.en
WHISPER_COREML=1 make -j talk-llama
./talk-llama -mw ./models/ggml-medium.en.bin -ml ./models/llama-13b-v2-q4_k_m.gguf -p "John" -t 8

Bench on V100 and M2 Ultra

./extra/bench-all.sh 
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 9.55 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
  64 x   64: Q4_0     2.6 GFLOPS (128 runs) | Q4_1     2.6 GFLOPS (128 runs)
  64 x   64: Q5_0     2.7 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     2.7 GFLOPS (128 runs)
  64 x   64: F16      2.8 GFLOPS (128 runs) | F32      2.8 GFLOPS (128 runs)
 128 x  128: Q4_0    19.6 GFLOPS (128 runs) | Q4_1    19.0 GFLOPS (128 runs)
 128 x  128: Q5_0    20.0 GFLOPS (128 runs) | Q5_1    19.4 GFLOPS (128 runs) | Q8_0    20.1 GFLOPS (128 runs)
 128 x  128: F16     20.1 GFLOPS (128 runs) | F32     20.4 GFLOPS (128 runs)
 256 x  256: Q4_0   100.4 GFLOPS (128 runs) | Q4_1   125.4 GFLOPS (128 runs)
 256 x  256: Q5_0   126.1 GFLOPS (128 runs) | Q5_1   124.7 GFLOPS (128 runs) | Q8_0   125.8 GFLOPS (128 runs)
 256 x  256: F16    126.3 GFLOPS (128 runs) | F32     83.7 GFLOPS (128 runs)
 512 x  512: Q4_0   418.4 GFLOPS (128 runs) | Q4_1   508.4 GFLOPS (128 runs)
 512 x  512: Q5_0   508.8 GFLOPS (128 runs) | Q5_1   481.6 GFLOPS (128 runs) | Q8_0   505.6 GFLOPS (128 runs)
 512 x  512: F16    493.2 GFLOPS (128 runs) | F32    432.7 GFLOPS (128 runs)
1024 x 1024: Q4_0  1824.2 GFLOPS (128 runs) | Q4_1  1828.2 GFLOPS (128 runs)
1024 x 1024: Q5_0  1782.9 GFLOPS (128 runs) | Q5_1  1658.3 GFLOPS (128 runs) | Q8_0  1627.1 GFLOPS (128 runs)
1024 x 1024: F16   1570.5 GFLOPS (128 runs) | F32   1326.5 GFLOPS (128 runs)
2048 x 2048: Q4_0  4511.6 GFLOPS (128 runs) | Q4_1  4620.4 GFLOPS (128 runs)
2048 x 2048: Q5_0  4580.0 GFLOPS (128 runs) | Q5_1  4445.6 GFLOPS (128 runs) | Q8_0  4302.6 GFLOPS (128 runs)
2048 x 2048: F16   3860.2 GFLOPS (128 runs) | F32   2686.4 GFLOPS (128 runs)
4096 x 4096: Q4_0  8142.4 GFLOPS ( 60 runs) | Q4_1  8071.0 GFLOPS ( 59 runs)
4096 x 4096: Q5_0  8094.0 GFLOPS ( 59 runs) | Q5_1  8068.9 GFLOPS ( 59 runs) | Q8_0  7546.4 GFLOPS ( 55 runs)
4096 x 4096: F16   6807.1 GFLOPS ( 50 runs) | F32   4301.2 GFLOPS ( 32 runs)
GPU OS Config Model Th Enc. Dec. PP Commit
NVIDIA V100 Ubuntu AVX2 BLAS CUDA tiny 1 8.85 1.86 4.31 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA tiny-q5_0 1 8.54 1.37 4.19 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA tiny-q5_1 1 8.46 1.33 4.22 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA base 1 14.90 2.55 5.87 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA base-q5_0 1 15.56 1.82 6.37 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA base-q5_1 1 15.16 1.78 5.94 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA small 1 40.54 4.77 12.61 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA small-q5_0 1 41.37 3.32 13.87 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA small-q5_1 1 41.32 3.34 13.31 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA medium 1 105.45 10.40 28.88 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA medium-q5_0 1 107.67 6.46 30.69 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA medium-q5_1 1 108.00 6.89 30.81 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA large 1 172.67 16.00 45.24 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA large-q5_0 1 177.31 8.93 49.94 9c1ddc7
NVIDIA V100 Ubuntu AVX2 BLAS CUDA large-q5_1 1 177.64 8.81 49.76 9c1ddc7
CPU OS Config Model Th Enc. Dec. PP Commit
M2 Ultra MacOS 14.1 COREML METAL tiny 4 7.74 1.38 3.40 997f7cb
M2 Ultra MacOS 14.1 COREML METAL tiny-q5_0 4 6.61 1.37 3.19 997f7cb
M2 Ultra MacOS 14.1 COREML METAL tiny-q5_1 4 7.32 1.39 3.03 997f7cb
M2 Ultra MacOS 14.1 COREML METAL base 4 12.51 2.00 4.61 997f7cb
M2 Ultra MacOS 14.1 COREML METAL base-q5_0 4 11.82 1.91 4.73 997f7cb
M2 Ultra MacOS 14.1 COREML METAL base-q5_1 4 11.62 1.94 4.79 997f7cb
M2 Ultra MacOS 14.1 COREML METAL small 4 32.00 3.92 12.12 997f7cb
M2 Ultra MacOS 14.1 COREML METAL small-q5_0 4 33.15 3.89 13.73 997f7cb
M2 Ultra MacOS 14.1 COREML METAL small-q5_1 4 33.28 3.91 13.64 997f7cb
M2 Ultra MacOS 14.1 COREML METAL medium 4 93.84 8.26 30.16 997f7cb
M2 Ultra MacOS 14.1 COREML METAL medium-q5_0 4 96.74 7.99 33.90 997f7cb
M2 Ultra MacOS 14.1 COREML METAL medium-q5_1 4 96.46 8.12 33.67 997f7cb
M2 Ultra MacOS 14.1 COREML METAL large 4 179.61 11.72 53.73 997f7cb
M2 Ultra MacOS 14.1 COREML METAL large-q5_0 4 185.15 11.77 62.17 997f7cb
M2 Ultra MacOS 14.1 COREML METAL large-q5_1 4 185.08 11.69 61.98 997f7cb
CPU OS Config Model Th Enc. Dec. PP Commit
M2 Ultra MacOS 14.1 METAL tiny 4 12.47 1.37 3.08 997f7cb
M2 Ultra MacOS 14.1 METAL tiny-q5_0 4 12.16 1.34 2.91 997f7cb
M2 Ultra MacOS 14.1 METAL tiny-q5_1 4 12.46 1.37 2.93 997f7cb
M2 Ultra MacOS 14.1 METAL tiny-q8_0 4 10.84 1.32 2.81 997f7cb
M2 Ultra MacOS 14.1 METAL base 4 17.90 1.93 4.53 997f7cb
M2 Ultra MacOS 14.1 METAL base-q5_0 4 19.77 1.93 4.71 997f7cb
M2 Ultra MacOS 14.1 METAL base-q5_1 4 19.73 1.91 4.69 997f7cb
M2 Ultra MacOS 14.1 METAL base-q8_0 4 18.83 1.89 4.63 997f7cb
M2 Ultra MacOS 14.1 METAL small 4 50.79 3.97 12.13 997f7cb
M2 Ultra MacOS 14.1 METAL small-q4_0 4 53.50 3.69 12.88 997f7cb
M2 Ultra MacOS 14.1 METAL small-q4_1 4 53.41 3.66 12.88 997f7cb
M2 Ultra MacOS 14.1 METAL small-q5_0 4 57.16 3.95 13.70 997f7cb
M2 Ultra MacOS 14.1 METAL small-q5_1 4 56.82 3.97 13.62 997f7cb
M2 Ultra MacOS 14.1 METAL small-q8_0 4 53.14 3.73 12.97 997f7cb
M2 Ultra MacOS 14.1 METAL medium 4 138.55 8.28 30.04 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q4_0 4 147.26 7.26 31.62 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q4_1 4 147.48 7.52 31.76 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q5_0 4 159.11 8.02 33.83 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q5_1 4 158.79 8.14 33.66 997f7cb
M2 Ultra MacOS 14.1 METAL medium-q8_0 4 146.50 7.82 32.16 997f7cb
M2 Ultra MacOS 14.1 METAL large 4 247.72 11.71 53.67 997f7cb
M2 Ultra MacOS 14.1 METAL large-q4_0 4 263.48 10.62 57.08 997f7cb
M2 Ultra MacOS 14.1 METAL large-q4_1 4 262.32 10.56 57.09 997f7cb
M2 Ultra MacOS 14.1 METAL large-q5_0 4 285.42 11.84 62.21 997f7cb
M2 Ultra MacOS 14.1 METAL large-q5_1 4 284.08 11.65 62.00 997f7cb
M2 Ultra MacOS 14.1 METAL large-q8_0 4 262.82 11.29 57.51 997f7cb

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models
@ggerganov
Copy link
Owner Author

Looking for feedback both with CUDA and Metal - the performance should be significantly improved

@ggerganov ggerganov changed the title whisper : add full CUDA offloading whisper : add full CUDA and Metal offloading Nov 10, 2023
@slaren
Copy link
Collaborator

slaren commented Nov 10, 2023

I am not very familiar with whisper.cpp, but these are my results using bench with a 3090 Ti under WSL. Let me know if you want me to run any other test.

PR
whisper_print_timings:     load time =   424.84 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =     9.48 ms /     1 runs (    9.48 ms per run)
whisper_print_timings:   decode time =   447.13 ms /   256 runs (    1.75 ms per run)
whisper_print_timings:   prompt time =   168.82 ms /    16 runs (   10.55 ms per run)
whisper_print_timings:    total time =   625.45 ms

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
  64 x   64: Q4_0     3.9 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     3.6 GFLOPS (128 runs) | Q5_1     3.4 GFLOPS (128 runs) | Q8_0     3.9 GFLOPS (128 runs)
  64 x   64: F16      3.8 GFLOPS (128 runs) | F32      3.5 GFLOPS (128 runs)
 128 x  128: Q4_0    27.9 GFLOPS (128 runs) | Q4_1    28.0 GFLOPS (128 runs)
 128 x  128: Q5_0    28.2 GFLOPS (128 runs) | Q5_1    28.0 GFLOPS (128 runs) | Q8_0    25.1 GFLOPS (128 runs)
 128 x  128: F16     25.2 GFLOPS (128 runs) | F32     25.1 GFLOPS (128 runs)
 256 x  256: Q4_0   171.0 GFLOPS (128 runs) | Q4_1   171.8 GFLOPS (128 runs)
 256 x  256: Q5_0   169.0 GFLOPS (128 runs) | Q5_1   172.1 GFLOPS (128 runs) | Q8_0   158.9 GFLOPS (128 runs)
 256 x  256: F16    152.0 GFLOPS (128 runs) | F32    155.1 GFLOPS (128 runs)
 512 x  512: Q4_0   651.6 GFLOPS (128 runs) | Q4_1   660.8 GFLOPS (128 runs)
 512 x  512: Q5_0   666.6 GFLOPS (128 runs) | Q5_1   662.9 GFLOPS (128 runs) | Q8_0   647.6 GFLOPS (128 runs)
 512 x  512: F16    615.5 GFLOPS (128 runs) | F32    549.1 GFLOPS (128 runs)
1024 x 1024: Q4_0  1945.6 GFLOPS (128 runs) | Q4_1  1944.4 GFLOPS (128 runs)
1024 x 1024: Q5_0  1928.3 GFLOPS (128 runs) | Q5_1  1900.2 GFLOPS (128 runs) | Q8_0  1784.3 GFLOPS (128 runs)
1024 x 1024: F16   1664.1 GFLOPS (128 runs) | F32   1403.1 GFLOPS (128 runs)
2048 x 2048: Q4_0  3900.1 GFLOPS (128 runs) | Q4_1  3947.6 GFLOPS (128 runs)
2048 x 2048: Q5_0  3900.6 GFLOPS (128 runs) | Q5_1  3840.9 GFLOPS (128 runs) | Q8_0  3745.9 GFLOPS (128 runs)
2048 x 2048: F16   3400.8 GFLOPS (128 runs) | F32   2619.8 GFLOPS (128 runs)
4096 x 4096: Q4_0  7538.0 GFLOPS ( 55 runs) | Q4_1  7372.8 GFLOPS ( 54 runs)
4096 x 4096: Q5_0  7445.0 GFLOPS ( 55 runs) | Q5_1  7437.4 GFLOPS ( 55 runs) | Q8_0  7143.9 GFLOPS ( 53 runs)
4096 x 4096: F16   6546.9 GFLOPS ( 48 runs) | F32   4950.8 GFLOPS ( 37 runs)
Master
whisper_print_timings:     load time =   409.18 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   274.91 ms /     1 runs (  274.91 ms per run)
whisper_print_timings:   decode time =   530.36 ms /   256 runs (    2.07 ms per run)
whisper_print_timings:   prompt time =  1032.40 ms /    16 runs (   64.52 ms per run)
whisper_print_timings:    total time =  1837.76 ms

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
  64 x   64: Q4_0     2.7 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     3.6 GFLOPS (128 runs) | Q5_1     3.6 GFLOPS (128 runs) | Q8_0     3.6 GFLOPS (128 runs)
  64 x   64: F16      3.6 GFLOPS (128 runs) | F32      3.5 GFLOPS (128 runs)
 128 x  128: Q4_0    28.1 GFLOPS (128 runs) | Q4_1    27.7 GFLOPS (128 runs)
 128 x  128: Q5_0    29.2 GFLOPS (128 runs) | Q5_1    28.0 GFLOPS (128 runs) | Q8_0    24.0 GFLOPS (128 runs)
 128 x  128: F16     24.8 GFLOPS (128 runs) | F32     25.0 GFLOPS (128 runs)
 256 x  256: Q4_0   168.7 GFLOPS (128 runs) | Q4_1   171.0 GFLOPS (128 runs)
 256 x  256: Q5_0   171.0 GFLOPS (128 runs) | Q5_1   168.1 GFLOPS (128 runs) | Q8_0   156.1 GFLOPS (128 runs)
 256 x  256: F16    161.0 GFLOPS (128 runs) | F32    153.7 GFLOPS (128 runs)
 512 x  512: Q4_0   657.4 GFLOPS (128 runs) | Q4_1   655.8 GFLOPS (128 runs)
 512 x  512: Q5_0   635.9 GFLOPS (128 runs) | Q5_1   664.7 GFLOPS (128 runs) | Q8_0   647.8 GFLOPS (128 runs)
 512 x  512: F16    613.6 GFLOPS (128 runs) | F32    546.8 GFLOPS (128 runs)
1024 x 1024: Q4_0  1960.1 GFLOPS (128 runs) | Q4_1  1944.5 GFLOPS (128 runs)
1024 x 1024: Q5_0  1953.9 GFLOPS (128 runs) | Q5_1  1941.3 GFLOPS (128 runs) | Q8_0  1831.5 GFLOPS (128 runs)
1024 x 1024: F16   1674.1 GFLOPS (128 runs) | F32   1419.4 GFLOPS (128 runs)
2048 x 2048: Q4_0  4077.5 GFLOPS (128 runs) | Q4_1  3989.5 GFLOPS (128 runs)
2048 x 2048: Q5_0  3965.7 GFLOPS (128 runs) | Q5_1  3926.2 GFLOPS (128 runs) | Q8_0  3800.3 GFLOPS (128 runs)
2048 x 2048: F16   3417.4 GFLOPS (128 runs) | F32   2679.9 GFLOPS (128 runs)
4096 x 4096: Q4_0  7663.1 GFLOPS ( 56 runs) | Q4_1  7623.9 GFLOPS ( 56 runs)
4096 x 4096: Q5_0  7596.4 GFLOPS ( 56 runs) | Q5_1  7460.4 GFLOPS ( 55 runs) | Q8_0  7270.6 GFLOPS ( 53 runs)
4096 x 4096: F16   6621.9 GFLOPS ( 49 runs) | F32   5056.4 GFLOPS ( 37 runs)

@ggerganov
Copy link
Owner Author

Yup, the mul mat benchmark is not very relevant to this PR because it still copies the data to the GPU, performs the multiplication and copies the data back to the CPU. The changes here should not affect the performance of this test.

The bench-all.sh script will look for multilingual models in the models folder and bench them.
You can get the models by running:

./models/download-ggml-model.sh tiny
./models/download-ggml-model.sh base
./models/download-ggml-model.sh small
./models/download-ggml-model.sh medium
./models/download-ggml-model.sh large

@bobqianic
Copy link
Collaborator

Just tried out this PR on my RTX3060 mobile and it's incredibly fast. A 27-minute audio file was transcribed in just 25 seconds. Plus, the transcription quality is not degraded.

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
whisper_model_load: using CUDA backend
whisper_model_load:     CUDA buffer size =   149.41 MB
whisper_model_load: model size    =  149.33 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   14.11 MB
whisper_init_state: compute buffer (encode) =   81.95 MB
whisper_init_state: compute buffer (cross)  =    4.49 MB
whisper_init_state: compute buffer (decode) =   24.70 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |

main: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\diffusion2023-07-03.wav' (26718958 samples, 1669.9 sec), 4 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: en (p = 0.969585)

[...]

whisper_print_timings:     load time =   833.29 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =  1076.23 ms
whisper_print_timings:   sample time =  2914.18 ms /  5311 runs (    0.55 ms per run)
whisper_print_timings:   encode time =  1623.05 ms /    65 runs (   24.97 ms per run)
whisper_print_timings:   decode time = 18438.33 ms /  5248 runs (    3.51 ms per run)
whisper_print_timings:   prompt time =   592.44 ms /    64 runs (    9.26 ms per run)
whisper_print_timings:    total time = 25668.54 ms

@slaren
Copy link
Collaborator

slaren commented Nov 10, 2023

CPU OS Config Model Th Enc. Dec. PP Commit
3090Ti WSL AVX2 BLAS CUDA tiny 1 4.92 1.21 10.29 3bfc43e
3090Ti WSL AVX2 BLAS CUDA tiny-q5_0 1 5.22 1.04 10.82 3bfc43e
3090Ti WSL AVX2 BLAS CUDA tiny-q5_1 1 5.03 1.04 9.95 3bfc43e
3090Ti WSL AVX2 BLAS CUDA base 1 9.46 1.82 11.94 3bfc43e
3090Ti WSL AVX2 BLAS CUDA base-q5_0 1 9.31 1.50 12.30 3bfc43e
3090Ti WSL AVX2 BLAS CUDA base-q5_1 1 9.56 1.50 11.72 3bfc43e
3090Ti WSL AVX2 BLAS CUDA small 1 25.18 3.30 14.75 3bfc43e
3090Ti WSL AVX2 BLAS CUDA small-q5_0 1 29.09 2.74 15.25 3bfc43e
3090Ti WSL AVX2 BLAS CUDA small-q5_1 1 27.79 2.89 16.06 3bfc43e
3090Ti WSL AVX2 BLAS CUDA medium 1 71.69 6.82 25.11 3bfc43e
3090Ti WSL AVX2 BLAS CUDA medium-q5_0 1 72.64 5.37 29.76 3bfc43e
3090Ti WSL AVX2 BLAS CUDA medium-q5_1 1 74.90 5.45 28.03 3bfc43e
3090Ti WSL AVX2 BLAS CUDA large 1 120.91 9.73 37.65 3bfc43e
3090Ti WSL AVX2 BLAS CUDA large-q5_0 1 123.26 7.38 42.26 3bfc43e
3090Ti WSL AVX2 BLAS CUDA large-q5_1 1 120.69 7.38 42.57 3bfc43e
Master
CPU OS Config Model Th Enc. Dec. PP Commit
3090Ti WSL AVX2 BLAS tiny 8 139.19 1.32 39.58 ec7a6f0
3090Ti WSL AVX2 BLAS tiny-q5_0 8 141.64 0.62 36.45 ec7a6f0
3090Ti WSL AVX2 BLAS tiny-q5_1 8 157.43 0.71 35.53 ec7a6f0
3090Ti WSL AVX2 BLAS base 8 293.83 1.87 75.99 ec7a6f0
3090Ti WSL AVX2 BLAS base-q5_0 8 273.09 1.19 72.18 ec7a6f0
3090Ti WSL AVX2 BLAS base-q5_1 8 274.03 1.11 72.07 ec7a6f0
3090Ti WSL AVX2 BLAS small 8 822.09 5.43 232.15 ec7a6f0
3090Ti WSL AVX2 BLAS small-q5_0 8 826.08 3.13 206.77 ec7a6f0
3090Ti WSL AVX2 BLAS small-q5_1 8 793.31 3.05 211.41 ec7a6f0
3090Ti WSL AVX2 BLAS medium 8 2136.57 16.08 621.28 ec7a6f0
3090Ti WSL AVX2 BLAS medium-q5_0 8 2074.32 8.77 562.22 ec7a6f0
3090Ti WSL AVX2 BLAS medium-q5_1 8 2087.68 8.91 566.57 ec7a6f0
3090Ti WSL AVX2 BLAS large 8 3585.90 30.75 1080.71 ec7a6f0
3090Ti WSL AVX2 BLAS large-q5_0 8 3445.49 15.97 935.09 ec7a6f0
3090Ti WSL AVX2 BLAS large-q5_1 8 3436.17 17.08 946.02 ec7a6f0

@ggerganov
Copy link
Owner Author

The quantize-all.sh script is broken yes

@slaren
Copy link
Collaborator

slaren commented Nov 10, 2023

Under native Windows I get an out of memory error in ggml-alloc very rarely. This is probably related to some allocation returning an unaligned memory address, I will look more into it tomorrow.

whisper_init_from_file_with_params_no_state: loading model from './models/ggml-tiny-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
whisper_model_load: using CUDA backend
whisper_model_load:     CUDA buffer size =    34.59 MB
whisper_model_load: model size    =   34.53 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB
whisper_init_state: compute buffer (conv)   =   11.54 MB
whisper_init_state: compute buffer (encode) =   59.65 MB
whisper_init_state: compute buffer (cross)  =    3.76 MB
whisper_init_state: compute buffer (decode) =   18.92 MB

system_info: n_threads = 1 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |
ggml_tallocr_alloc: not enough space in the buffer (needed 54000000, largest block available 51696128)
GGML_ASSERT: C:\CODE\whisper.cpp\ggml-alloc.c:116: !"not enough space in the buffer"

@dreness
Copy link

dreness commented Nov 11, 2023

I see a notable improvement in encoder times from this PR - nice work :) I also noticed that with this PR, performance is pretty flat from 4 through 10 threads. With main @ ec7a6f0 there is a a bit of improvement for me up through 8 threads, but even at 8 threads it's slower than this PR.

main @ ec7a6f0

CPU OS Config Model Th Enc. Dec. PP Commit
Apple M1 Max 14.1 NEON BLAS METAL tiny 4 27.62 1.48 4.54 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL base 4 49.84 2.27 7.79 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL small 4 137.73 4.85 21.98 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL medium 4 360.75 9.94 57.96 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL large 4 633.79 15.24 101.62 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL tiny 6 25.15 1.55 4.58 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL base 6 45.27 2.31 7.83 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL small 6 127.70 4.93 22.11 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL medium 6 337.98 10.09 58.36 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL large 6 599.25 15.43 100.36 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL tiny 8 22.67 1.60 4.63 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL base 8 42.72 2.34 7.89 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL small 8 122.77 4.98 22.01 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL medium 8 337.37 9.95 58.24 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL large 8 588.09 15.55 101.77 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL tiny 10 27.00 1.66 4.63 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL base 10 45.32 2.44 7.96 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL small 10 127.46 5.02 22.31 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL medium 10 342.60 9.75 58.30 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL large 10 593.89 15.67 102.05 ec7a6f0

ggml-backend-no-sched @ 3bfc43e

CPU OS Config Model Th Enc. Dec. PP Commit
Apple M1 Max 14.1 NEON BLAS METAL tiny 4 20.03 1.52 4.54 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL base 4 38.58 2.21 7.72 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL small 4 115.09 4.89 22.05 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL medium 4 318.40 9.84 58.26 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL large 4 564.22 15.28 101.51 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL tiny 6 20.21 1.60 4.61 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL base 6 38.61 2.31 7.80 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL small 6 115.40 4.96 22.18 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL medium 6 318.59 10.12 58.47 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL large 6 564.97 15.36 101.84 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL tiny 8 20.30 1.63 4.63 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL base 8 38.82 2.37 7.93 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL small 8 115.64 4.96 22.21 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL medium 8 318.38 10.19 58.61 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL large 8 564.98 15.50 101.93 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL tiny 10 20.49 1.67 4.66 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL base 10 38.91 2.37 7.97 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL small 10 116.40 5.05 22.27 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL medium 10 318.70 10.23 58.60 3bfc43e
Apple M1 Max 14.1 NEON BLAS METAL large 10 564.15 15.62 101.70 3bfc43e

Comment on lines 209 to 213
// TODO: check if other platforms can benefit from this optimization
// TODO: CUDA is currently broken - seems ggml_mul_mat does not handle views correctly
#if defined(GGML_USE_METAL)
#define ggml_mul_mat ggml_mul_mat_pad
#endif
Copy link
Owner Author

@ggerganov ggerganov Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ggml_mul_mat_pad trick is very useful for the Metal kernels and provides significant improvement for the encoder.

Currently, this trick does not work with CUDA because we seem to have issues in some cases when the src are non-contiguous views. At the very least ggml_cuda_mul_mat_mat_batched_cublas does not handle all cases correctly for src1 being non-contiguous because ggml_get_to_fp16_cuda() assumes data without "holes" (i.e. contiguously-permuted), but there might be other issues as well.

We should keep this in mind and fix or assert properly

@dreness
Copy link

dreness commented Nov 11, 2023

Figured I'd also include a comparison of this PR to main in benchmarks with 1 - 4 threads. Encoder times with ggml-backend-no-sched @ 0867e69 are still flat. I won't pretend to understand all the code, but this does feel like "no scheduling" to me :)

ggml-backend-no-sched @ 0867e69

CPU OS Config Model Th Enc. Dec. PP Commit
Apple M1 Max 14.1 NEON BLAS METAL tiny 1 19.73 1.45 4.51 0867e69
Apple M1 Max 14.1 NEON BLAS METAL base 1 38.25 2.20 7.78 0867e69
Apple M1 Max 14.1 NEON BLAS METAL small 1 114.96 4.89 21.79 0867e69
Apple M1 Max 14.1 NEON BLAS METAL medium 1 317.80 10.10 58.41 0867e69
Apple M1 Max 14.1 NEON BLAS METAL large 1 564.51 15.42 101.79 0867e69
Apple M1 Max 14.1 NEON BLAS METAL tiny 2 19.76 1.45 4.42 0867e69
Apple M1 Max 14.1 NEON BLAS METAL base 2 38.21 2.18 7.66 0867e69
Apple M1 Max 14.1 NEON BLAS METAL small 2 114.94 4.83 21.95 0867e69
Apple M1 Max 14.1 NEON BLAS METAL medium 2 317.65 10.00 58.29 0867e69
Apple M1 Max 14.1 NEON BLAS METAL large 2 564.53 15.00 100.43 0867e69
Apple M1 Max 14.1 NEON BLAS METAL tiny 3 19.56 1.47 4.39 0867e69
Apple M1 Max 14.1 NEON BLAS METAL base 3 37.86 2.18 7.59 0867e69
Apple M1 Max 14.1 NEON BLAS METAL small 3 115.23 4.84 21.95 0867e69
Apple M1 Max 14.1 NEON BLAS METAL medium 3 319.08 9.97 57.92 0867e69
Apple M1 Max 14.1 NEON BLAS METAL large 3 564.12 15.04 100.82 0867e69
Apple M1 Max 14.1 NEON BLAS METAL tiny 4 19.94 1.53 4.51 0867e69
Apple M1 Max 14.1 NEON BLAS METAL base 4 39.15 2.23 7.77 0867e69
Apple M1 Max 14.1 NEON BLAS METAL small 4 115.17 4.85 22.02 0867e69
Apple M1 Max 14.1 NEON BLAS METAL medium 4 317.77 9.93 58.31 0867e69
Apple M1 Max 14.1 NEON BLAS METAL large 4 564.59 15.24 101.59 0867e69

main @ ec7a6f0

CPU OS Config Model Th Enc. Dec. PP Commit
Apple M1 Max 14.1 NEON BLAS METAL tiny 1 54.36 1.45 4.42 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL base 1 95.79 2.24 7.71 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL small 1 230.22 4.87 21.98 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL medium 1 506.86 10.02 58.11 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL large 1 871.68 15.16 99.93 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL tiny 2 35.83 1.48 4.45 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL base 2 63.46 2.22 7.74 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL small 2 169.18 4.78 21.72 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL medium 2 408.08 9.97 56.43 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL large 2 708.51 15.21 101.22 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL tiny 3 30.43 1.49 4.47 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL base 3 54.36 2.24 7.59 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL small 3 147.03 4.76 21.95 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL medium 3 372.93 9.94 57.87 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL large 3 653.22 15.23 101.46 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL tiny 4 27.01 1.52 4.52 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL base 4 49.59 2.22 7.70 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL small 4 136.85 4.84 21.97 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL medium 4 355.03 9.94 57.95 ec7a6f0
Apple M1 Max 14.1 NEON BLAS METAL large 4 626.61 15.20 101.43 ec7a6f0
ggml-no-sched-vs-main

@ggerganov
Copy link
Owner Author

Nice plot! Yeah, on master, small part of the Encoder (the 2 convolutions + GELU activations) where performed on the CPU because we didn't have the necessary Metal kernels. With some help recently by @FSSRepo, we now have the kernels both for Metal and CUDA, so with this PR, no computation is done on the CPU anymore and the performance should not depend on the number of threads.

@slaren

This comment was marked as outdated.

@ggerganov
Copy link
Owner Author

looks like sometimes different buffers may be allocated in the same address and that can confuse ggml-alloc

Let me know if I can help debug this somehow. Haven't been able to reproduce with Linux and MacOS yet.

@slaren
Copy link
Collaborator

slaren commented Nov 11, 2023

The issue is that the encoder graph uses tensors from a previous graph. During measure, these tensors are allocated in a measure buffer which has already been freed (when the measure allocator was freed). Sometimes, malloc will return the same address for the encode measure buffer as for the measure buffer used in the previous graph. This causes ggml-alloc to think that these tensors are from the same buffer and tries to reuse their memory. As a result, when that happens the encode buffer size is measured to be smaller.

A workaround would be to keep the same measure allocators alive until all the graphs have been measured, and only then reallocate the buffers and allocators with the correct sizes. I suppose that whisper.cpp is using freed tensors so it's not unreasonable to consider this "undefined behavior", but practically this is not a good limitation to have, so I want to fix this in ggml-alloc/ggml-backend by allowing the same buffers to be reallocated, but that's not going to be a quick fix.

@slaren
Copy link
Collaborator

slaren commented Nov 11, 2023

This doesn't fix that issue, but while looking into this I also found other problems:

diff --git a/whisper.cpp b/whisper.cpp
index eb69f96..a786593 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -636,12 +636,11 @@ static void whisper_allocr_graph_init(struct whisper_allocr & allocr, ggml_backe
     auto & meta   = allocr.meta;
     auto & buffer = allocr.buffer;

-    const int tensor_alignment = ggml_backend_get_alignment(backend);
-    alloc = ggml_allocr_new_measure(tensor_alignment);
+    alloc = ggml_allocr_new_measure_from_backend(backend);

     meta.resize(ggml_tensor_overhead()*WHISPER_MAX_NODES + ggml_graph_overhead());

-    const size_t alloc_size = ggml_allocr_alloc_graph(alloc, get_graph()) + tensor_alignment;
+    const size_t alloc_size = ggml_allocr_alloc_graph(alloc, get_graph());

     ggml_allocr_free(alloc);

@@ -1284,7 +1283,7 @@ static bool whisper_model_load(struct whisper_model_loader * loader, whisper_con

         // initialize the backends
 #ifdef GGML_USE_CUBLAS
-        if (wctx.params.use_gpu > 0) {
+        if (wctx.params.use_gpu) {
             WHISPER_LOG_INFO("%s: using CUDA backend\n", __func__);
             backend_gpu = ggml_backend_cuda_init();
             if (!backend_gpu) {

@ggerganov
Copy link
Owner Author

ggerganov commented Nov 11, 2023

A workaround would be to keep the same measure allocators alive until all the graphs have been measured

Ok, I'll try to apply this. If it is a quick fix, feel free to apply it here since I don't have a Windows machine to test with.

I also realized another issue - the -p option can be used to split an audio file in chunks and process those chunks in parallel with multiple whisper_state instances. Currently, the states share the same backend instance which is stored in whisper_model whisper_context. But this is not thread-safe because during ggml_backend_graph_compute() the same backend work buffer will be used by all states.

I plan to create a new backend instance for each new whisper_state, while also keeping the backend instance in whisper_model whisper_context for creating the buffer holding the model tensors. Does this sound ok?

@slaren
Copy link
Collaborator

slaren commented Nov 11, 2023

Yes, that should work. I also realized that this would be an issue in llama.cpp when creating multiple llama_context of a llama_model. My conclusion is that buffers need to be decoupled from the backend instances, but that's a bigger change.

@slaren
Copy link
Collaborator

slaren commented Nov 11, 2023

This should fix the issue with MSVC:

diff --git a/whisper.cpp b/whisper.cpp
index d16492c..471d9a8 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -642,7 +642,7 @@ struct whisper_allocr {
 };

 static size_t whisper_allocr_size(struct whisper_allocr & allocr) {
-    return allocr.meta.size() + ggml_backend_buffer_get_size(allocr.buffer);
+    return allocr.meta.size() + ggml_allocr_max_size(allocr.alloc);
 }

 // measure the memory usage of a graph and prepare the allocr's internal data buffer
@@ -655,12 +655,19 @@ static void whisper_allocr_graph_init(struct whisper_allocr & allocr, ggml_backe

     meta.resize(ggml_tensor_overhead()*WHISPER_MAX_NODES + ggml_graph_overhead());

-    const size_t alloc_size = ggml_allocr_alloc_graph(alloc, get_graph());
+    ggml_allocr_alloc_graph(alloc, get_graph());
+}
+
+static void whisper_allocr_graph_realloc(struct whisper_allocr & allocr, ggml_backend_t backend) {
+    auto & alloc  = allocr.alloc;
+    auto & buffer = allocr.buffer;
+
+    size_t size = ggml_allocr_max_size(alloc);

     ggml_allocr_free(alloc);

-    buffer = ggml_backend_alloc_buffer(backend, alloc_size);
-    alloc  = ggml_allocr_new_from_buffer(buffer);
+    buffer = ggml_backend_alloc_buffer(backend, size);
+    alloc = ggml_allocr_new_from_buffer(buffer);
 }

 static void whisper_allocr_free(struct whisper_allocr & allocr) {
@@ -2915,6 +2922,11 @@ struct whisper_state * whisper_init_state(whisper_context * ctx) {
         WHISPER_LOG_INFO("%s: compute buffer (decode) = %7.2f MB\n", __func__, whisper_allocr_size(state->alloc_decode) / 1024.0 / 1024.0);
     }

+    whisper_allocr_graph_realloc(state->alloc_conv, ctx->backend);
+    whisper_allocr_graph_realloc(state->alloc_encode, ctx->backend);
+    whisper_allocr_graph_realloc(state->alloc_cross, ctx->backend);
+    whisper_allocr_graph_realloc(state->alloc_decode, ctx->backend);
+
     state->rng = std::mt19937(0);

     return state;

Native windows bench:

CPU OS Config Model Th Enc. Dec. PP Commit
3090Ti Win11 AVX2 BLAS CUDA tiny 1 5.31 1.36 8.64 fc8565d
3090Ti Win11 AVX2 BLAS CUDA tiny-q5_0 1 5.09 1.13 8.06 fc8565d
3090Ti Win11 AVX2 BLAS CUDA tiny-q5_1 1 5.15 1.14 8.83 fc8565d
3090Ti Win11 AVX2 BLAS CUDA base 1 9.39 1.90 9.15 fc8565d
3090Ti Win11 AVX2 BLAS CUDA base-q5_0 1 9.54 1.59 9.85 fc8565d
3090Ti Win11 AVX2 BLAS CUDA base-q5_1 1 9.53 1.58 9.94 fc8565d
3090Ti Win11 AVX2 BLAS CUDA small 1 25.58 3.72 13.55 fc8565d
3090Ti Win11 AVX2 BLAS CUDA small-q5_0 1 26.21 2.96 14.32 fc8565d
3090Ti Win11 AVX2 BLAS CUDA small-q5_1 1 26.11 2.96 14.48 fc8565d
3090Ti Win11 AVX2 BLAS CUDA medium 1 69.94 7.68 23.45 fc8565d
3090Ti Win11 AVX2 BLAS CUDA medium-q5_0 1 71.93 5.78 25.80 fc8565d
3090Ti Win11 AVX2 BLAS CUDA medium-q5_1 1 71.91 5.77 25.67 fc8565d
3090Ti Win11 AVX2 BLAS CUDA large 1 116.59 10.81 33.64 fc8565d
3090Ti Win11 AVX2 BLAS CUDA large-q5_0 1 120.39 7.74 37.87 fc8565d
3090Ti Win11 AVX2 BLAS CUDA large-q5_1 1 119.76 7.69 37.96 fc8565d

@ggerganov
Copy link
Owner Author

Thanks. The backend fix seems to work for the CPU, but it breaks with Metal because each backend (i.e. ggml_metal_context) keeps track of the associated buffers and they can end up in different backend instances. Will be looking tomorrow to find a fix for this

@bobqianic
Copy link
Collaborator

It appears that the setup-qemu-action is experiencing problems, impacting a significant portion of our CI testing. docker/setup-qemu-action#110

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
@ggerganov ggerganov merged commit b050283 into master Nov 12, 2023
68 of 72 checks passed
felrock pushed a commit to felrock/whisper.cpp that referenced this pull request Nov 18, 2023
* whisper : migrate to ggml-backend

* whisper : fix logit reading

* whisper : fix tensor allocation during load

* whisper : fix beam-search with CUDA

* whisper : free backends + fix compile warning

* whisper : print when CUDA is enabled

* whisper : fix CoreML

* make : clean-up

* talk : fix compile warning

* whisper : support ggml_conv with CUDA and Metal (ggerganov#1473)

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models

* whisper : clean-up

* quantize-all : fix

* ggml : im2col opts

* whisper : avoid whisper_model_data wrapper

* whisper : add note that ggml_mul_mat_pad does not work with CUDA

* whisper : factor out graph compute in common function

* whisper : fixes

* whisper : fix UB with measure buffers

* whisper : try to fix the parallel whisper_state functionality (ggerganov#1479)

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
@100tomer
Copy link

In my testing on m1 pro its slower on GPU compared to 8/10 threads cpu. Does this make any sense?
I tested converting 2 audio files and on cpu it was 1:12 and on GPU like 3 minutes...
Also I made sure it does using the GPU its on 100% usage...

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
* whisper : migrate to ggml-backend

* whisper : fix logit reading

* whisper : fix tensor allocation during load

* whisper : fix beam-search with CUDA

* whisper : free backends + fix compile warning

* whisper : print when CUDA is enabled

* whisper : fix CoreML

* make : clean-up

* talk : fix compile warning

* whisper : support ggml_conv with CUDA and Metal (ggerganov#1473)

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models

* whisper : clean-up

* quantize-all : fix

* ggml : im2col opts

* whisper : avoid whisper_model_data wrapper

* whisper : add note that ggml_mul_mat_pad does not work with CUDA

* whisper : factor out graph compute in common function

* whisper : fixes

* whisper : fix UB with measure buffers

* whisper : try to fix the parallel whisper_state functionality (ggerganov#1479)

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* whisper : migrate to ggml-backend

* whisper : fix logit reading

* whisper : fix tensor allocation during load

* whisper : fix beam-search with CUDA

* whisper : free backends + fix compile warning

* whisper : print when CUDA is enabled

* whisper : fix CoreML

* make : clean-up

* talk : fix compile warning

* whisper : support ggml_conv with CUDA and Metal (ggerganov#1473)

* ggml : add CUDA support for ggml_conv

* whisper : remove ggml_repeat for conv bias + single backend

* cuda : fix im2col kernel

* metal : add im2col support + mul mat-vec f16 x f16

* bench-all : add q4 models

* whisper : clean-up

* quantize-all : fix

* ggml : im2col opts

* whisper : avoid whisper_model_data wrapper

* whisper : add note that ggml_mul_mat_pad does not work with CUDA

* whisper : factor out graph compute in common function

* whisper : fixes

* whisper : fix UB with measure buffers

* whisper : try to fix the parallel whisper_state functionality (ggerganov#1479)

* whisper : try to fix the parallel whisper_state functionality

* whisper : fix multi-state Metal

* whisper : free backend instances in whisper_state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants