Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper : use flash attention #2152

Merged
merged 9 commits into from
May 15, 2024
Merged

whisper : use flash attention #2152

merged 9 commits into from
May 15, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented May 14, 2024

Flash attention can now be enabled via whisper_context.flash_attn = true.
Examples use the command-line argument -fa to enable the kernels (similar to llama.cpp)

Performance gains should be expected for Metal and CUDA. On the CPU, enabling FA will likely degrade the performance.

M1 Pro

CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M1 Pro METAL tiny 1 0 39.21 1.74 0.61 0.04 22c96b4
M1 Pro METAL base 1 0 70.76 2.60 0.93 0.06 22c96b4
M1 Pro METAL small 1 0 217.28 6.42 2.14 0.17 22c96b4
M1 Pro METAL medium 1 0 596.74 14.43 4.75 0.45 22c96b4
CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M1 Pro METAL tiny 1 1 30.77 1.59 0.54 0.03 22c96b4
M1 Pro METAL base 1 1 60.42 2.29 0.81 0.05 22c96b4
M1 Pro METAL small 1 1 183.82 5.12 1.81 0.14 22c96b4
M1 Pro METAL medium 1 1 517.92 11.60 4.01 0.38 22c96b4

M2 Ultra

CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M2 ULTRA METAL tiny 1 0 12.32 1.35 0.49 0.01 22c96b4
M2 ULTRA METAL tiny-q5_0 1 0 11.65 1.30 0.51 0.01 22c96b4
M2 ULTRA METAL tiny-q5_1 1 0 12.08 1.30 0.51 0.01 22c96b4
M2 ULTRA METAL base 1 0 17.58 1.90 0.76 0.02 22c96b4
M2 ULTRA METAL base-q5_0 1 0 18.89 1.86 0.79 0.02 22c96b4
M2 ULTRA METAL base-q5_1 1 0 20.69 1.88 0.79 0.02 22c96b4
M2 ULTRA METAL small 1 0 49.32 3.85 1.71 0.05 22c96b4
M2 ULTRA METAL small-q5_0 1 0 54.91 3.81 1.82 0.06 22c96b4
M2 ULTRA METAL small-q5_1 1 0 54.92 3.81 1.79 0.06 22c96b4
M2 ULTRA METAL medium 1 0 134.34 8.04 3.82 0.13 22c96b4
M2 ULTRA METAL medium-q5_0 1 0 151.68 7.59 4.07 0.14 22c96b4
M2 ULTRA METAL medium-q5_1 1 0 151.58 7.67 4.07 0.14 22c96b4
M2 ULTRA METAL medium-dis 1 0 120.82 1.07 0.41 0.02 22c96b4
M2 ULTRA METAL large-v2 1 0 235.63 12.27 5.85 0.22 22c96b4
M2 ULTRA METAL large-v2-q5_0 1 0 273.38 11.17 6.40 0.26 22c96b4
M2 ULTRA METAL large-v2-q5_1 1 0 272.44 11.32 6.29 0.26 22c96b4
M2 ULTRA METAL large-v2-dis 1 0 212.51 1.20 0.47 0.02 22c96b4
CPU Config Model Th FA Enc. Dec. Bch5 PP Commit
M2 ULTRA METAL tiny 1 1 9.07 1.33 0.45 0.01 22c96b4
M2 ULTRA METAL tiny-q5_0 1 1 9.74 1.33 0.47 0.01 22c96b4
M2 ULTRA METAL tiny-q5_1 1 1 8.93 1.31 0.46 0.01 22c96b4
M2 ULTRA METAL base 1 1 15.75 1.87 0.71 0.02 22c96b4
M2 ULTRA METAL base-q5_0 1 1 17.04 1.83 0.74 0.02 22c96b4
M2 ULTRA METAL base-q5_1 1 1 17.17 1.83 0.74 0.02 22c96b4
M2 ULTRA METAL small 1 1 42.33 3.64 1.60 0.05 22c96b4
M2 ULTRA METAL small-q5_0 1 1 47.61 3.63 1.70 0.05 22c96b4
M2 ULTRA METAL small-q5_1 1 1 47.70 3.66 1.68 0.05 22c96b4
M2 ULTRA METAL medium 1 1 114.42 7.53 3.55 0.11 22c96b4
M2 ULTRA METAL medium-q5_0 1 1 132.63 7.02 3.77 0.13 22c96b4
M2 ULTRA METAL medium-q5_1 1 1 132.28 7.10 3.76 0.13 22c96b4
M2 ULTRA METAL medium-dis 1 1 102.34 1.01 0.42 0.01 22c96b4
M2 ULTRA METAL large-v2 1 1 203.01 11.03 5.45 0.20 22c96b4
M2 ULTRA METAL large-v2-q5_0 1 1 240.05 10.18 5.98 0.23 22c96b4
M2 ULTRA METAL large-v2-q5_1 1 1 239.22 10.23 5.87 0.23 22c96b4
M2 ULTRA METAL large-v2-dis 1 1 181.14 1.14 0.48 0.02 22c96b4

Ryzen 9 5950X + RTX 2060

GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
RTX 2060 AVX2 CUDA tiny 8 0 12.54 0.93 0.29 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_0 8 0 12.73 0.98 0.24 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_1 8 0 12.72 0.99 0.24 0.02 22c96b4
RTX 2060 AVX2 CUDA base 8 0 24.14 1.28 0.41 0.03 22c96b4
RTX 2060 AVX2 CUDA base-q5_0 8 0 24.58 1.38 0.35 0.03 22c96b4
RTX 2060 AVX2 CUDA base-q5_1 8 0 24.58 1.37 0.35 0.03 22c96b4
RTX 2060 AVX2 CUDA small 8 0 74.70 2.91 0.84 0.07 22c96b4
RTX 2060 AVX2 CUDA small-q5_0 8 0 76.12 2.84 0.77 0.08 22c96b4
RTX 2060 AVX2 CUDA small-q5_1 8 0 76.14 2.84 0.76 0.08 22c96b4
RTX 2060 AVX2 CUDA medium 8 0 200.69 6.46 1.83 0.17 22c96b4
RTX 2060 AVX2 CUDA medium-q5_0 8 0 204.80 5.90 1.65 0.19 22c96b4
RTX 2060 AVX2 CUDA medium-q5_1 8 0 205.61 5.85 1.61 0.19 22c96b4
RTX 2060 AVX2 CUDA medium-dis 8 0 186.17 0.86 0.24 0.02 22c96b4
RTX 2060 AVX2 CUDA large-v2 8 0 347.22 10.36 2.82 0.29 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_0 8 0 357.06 8.81 2.58 0.34 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_1 8 0 356.97 8.62 2.49 0.33 22c96b4
RTX 2060 AVX2 CUDA large-v2-dis 8 0 318.05 1.03 0.34 0.04 22c96b4
GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
RTX 2060 AVX2 CUDA tiny 8 1 7.21 0.76 0.29 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_0 8 1 7.42 0.82 0.18 0.02 22c96b4
RTX 2060 AVX2 CUDA tiny-q5_1 8 1 7.38 0.82 0.18 0.02 22c96b4
RTX 2060 AVX2 CUDA base 8 1 13.49 1.04 0.36 0.02 22c96b4
RTX 2060 AVX2 CUDA base-q5_0 8 1 13.94 1.13 0.26 0.03 22c96b4
RTX 2060 AVX2 CUDA base-q5_1 8 1 13.94 1.14 0.26 0.03 22c96b4
RTX 2060 AVX2 CUDA small 8 1 42.81 2.33 0.69 0.05 22c96b4
RTX 2060 AVX2 CUDA small-q5_0 8 1 44.43 2.25 0.59 0.06 22c96b4
RTX 2060 AVX2 CUDA small-q5_1 8 1 44.11 2.24 0.58 0.06 22c96b4
RTX 2060 AVX2 CUDA medium 8 1 115.47 5.17 1.45 0.11 22c96b4
RTX 2060 AVX2 CUDA medium-q5_0 8 1 120.37 4.63 1.25 0.13 22c96b4
RTX 2060 AVX2 CUDA medium-q5_1 8 1 120.28 4.55 1.21 0.13 22c96b4
RTX 2060 AVX2 CUDA medium-dis 8 1 101.69 0.75 0.20 0.02 22c96b4
RTX 2060 AVX2 CUDA large-v2 8 1 205.67 8.49 2.19 0.18 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_0 8 1 214.07 6.88 1.94 0.22 22c96b4
RTX 2060 AVX2 CUDA large-v2-q5_1 8 1 213.98 6.70 1.86 0.22 22c96b4
RTX 2060 AVX2 CUDA large-v2-dis 8 1 176.71 0.91 0.31 0.03 22c96b4

V100

GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
V100 AVX2 CUDA tiny 1 0 6.21 1.11 0.30 0.02 22c96b4
V100 AVX2 CUDA tiny-q5_1 1 0 5.97 1.10 0.26 0.02 22c96b4
V100 AVX2 CUDA base 1 0 10.95 1.47 0.42 0.03 22c96b4
V100 AVX2 CUDA base-q5_1 1 0 11.13 1.53 0.36 0.03 22c96b4
V100 AVX2 CUDA small 1 0 31.57 2.96 0.84 0.05 22c96b4
V100 AVX2 CUDA small-q5_1 1 0 32.19 3.14 0.75 0.05 22c96b4
V100 AVX2 CUDA medium 1 0 85.88 6.49 1.80 0.10 22c96b4
V100 AVX2 CUDA medium-q5_0 1 0 87.53 5.82 1.37 0.10 22c96b4
V100 AVX2 CUDA large-v2 1 0 142.23 8.92 2.62 0.15 22c96b4
GPU Config Model Th FA Enc. Dec. Bch5 PP Commit
V100 AVX2 CUDA tiny 1 1 3.96 0.82 0.24 0.02 22c96b4
V100 AVX2 CUDA tiny-q5_1 1 1 4.05 0.85 0.18 0.02 22c96b4
V100 AVX2 CUDA base 1 1 7.21 1.16 0.36 0.02 22c96b4
V100 AVX2 CUDA base-q5_1 1 1 7.39 1.21 0.26 0.02 22c96b4
V100 AVX2 CUDA small 1 1 19.81 2.41 0.71 0.04 22c96b4
V100 AVX2 CUDA small-q5_1 1 1 20.50 2.31 0.51 0.04 22c96b4
V100 AVX2 CUDA medium 1 1 56.02 4.89 1.44 0.07 22c96b4
V100 AVX2 CUDA medium-q5_0 1 1 57.85 4.73 1.09 0.08 22c96b4
V100 AVX2 CUDA large-v2 1 1 92.73 7.18 2.14 0.10 22c96b4

@ggerganov ggerganov force-pushed the gg/flash-attn branch 3 times, most recently from 497dbf4 to bfbfde8 Compare May 14, 2024 16:07
@ggerganov ggerganov marked this pull request as ready for review May 14, 2024 17:25
@ggerganov
Copy link
Owner Author

Looking for feedback on the performance / accuracy - plan is to merge this PR and release v1.6.0

Run the tools as usual and add -fa to the command line to enable Flash Attention

@ggerganov ggerganov merged commit 7094ea5 into master May 15, 2024
94 of 98 checks passed
@ggerganov ggerganov deleted the gg/flash-attn branch May 15, 2024 06:38
bygreencn added a commit to bygreencn/whisper.cpp that referenced this pull request Aug 9, 2024
* tag 'v1.6.2':
  release : v1.6.2
  Revert "whisper : remove extra backend instance (huh?)" (ggerganov#2182)
  server : fix typo (ggerganov#2181)
  ruby : update bindings (ggerganov#2154)
  release : v1.6.1
  examples : add support for decoding input with ffmpeg (Linux) (ggerganov#2133)
  node : add flash_attn param (ggerganov#2170)
  ci: Update build.yml to suppress warnings about node.js versions (ggerganov#2166)
  release : v1.6.0
  whisper : use flash attention (ggerganov#2152)
  talk-llama : reject runs without required arguments (ggerganov#2153)
  sync : ggml
  metal : support FA without mask + add asserts (llama/7278)
  ggml : add RPC backend (llama/6829)
  rm wait() (llama/7233)
  CUDA: add FP32 FlashAttention vector kernel (llama/7188)
  scripts : sync ggml-rpc
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* whisper : use flash attention in the encoder

* whisper : add kv_pad

* whisper : remove extra backend instance (huh?)

* whisper : use FA for cross-attention

* whisper : use FA for self-attention

* whisper : simplify encoder FA

* whisper : add flash_attn runtime parameter

* scripts : add bench log

* scripts : add M1 Pro bench log
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* whisper : use flash attention in the encoder

* whisper : add kv_pad

* whisper : remove extra backend instance (huh?)

* whisper : use FA for cross-attention

* whisper : use FA for self-attention

* whisper : simplify encoder FA

* whisper : add flash_attn runtime parameter

* scripts : add bench log

* scripts : add M1 Pro bench log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant