-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is it possible to run openai-whisper ggml model on raspberry pi hardware? #7
Comments
@ggerganov could you please help on this ? |
It will probably work - why don't you give it a try? |
Good news! I just tried it on a Raspberry Pi 4 Model B from 2018 and it works! The If you want to try it, use the raspberry branch:
|
I don't currently have a Raspberry Pi board, but I will run as soon as I get one. |
You can try running it on whatever Raspberry you have - use the same instructions. |
@ggerganov Thanks and appreciate for your quick response |
Some more experiments - enabling NEON instructions reduces the time all the way down to just ~15 seconds to process a 30 second audio. |
this is awesome |
@ggerganov is it possible to do the audio streaming on Raspberry Pi and live convert it to captions? |
@ggerganov |
@ggerganov On a linux computer, I tried the following commands, and streaming performed as expected. |
@ggerganov Used the following command to run a stream on a Raspberry Pi 4, but its decoding speed is slow(performance is poor). |
@nyadla-sys Based on this table, you need a device with a Cortex-A75 CPU: https://en.wikipedia.org/wiki/Comparison_of_Armv8-A_processors From a quick google search, none of the existing Raspberry products comes with this processor. There are rumours that Raspberry Pi 5 will use |
@ggerganov |
8-bit is not supported yet - maybe in the future |
Do you need me to test this on a raspi-zero? I bet it would be very very slow. |
It will be very slow - yes. But still interesting to see how long it would take to process |
No cigar I have the old raspi zero w, and it was not connected to internet to update clock. Short Log: Full Log:
I think its this flag: -mfpu=neon-fp-armv8 as we are on ARMv6... Extremely unwell. Will continue experiments soon. I hope this will help you. |
Yes - you are probably right. |
GCC info:
CPU Info:
Info: https://gist.github.com/fm4dd/c663217935dc17f0fc73c9c81b0aa845 |
Yeah, I'm not an expert when it comes to arm architectures and compile flags. Maybe try replacing |
Got Cigar after 35 Minutes! But it was damn slow and I could take a screenshot only because I was using mosh. First it did not work with the makefile compiler flags alone. I had to comment out a line in ggml.c Makefile Line 38:
ggml.c Line 70:
|
I have a Kindle which is jailbroken, has a alpine distro and X, I'll try it on that too when I am well. It has a i.MX 6ULL with Cortex-A7 @528 MHz . I had booted a x86_64 custom linux with X on it already with qemu in alpine arm. It sort of worked well to my surprise. I think for whisper, it may be twice as fast as raspi 0. How hard would it be for you to support opencl as an additional backend for ggml? It would be a great use case as opencl could help accelerate it even on raspis, amd systems, android phones and other low power devices that have a gpu. |
@nyadla-sys I think the answer is yes and probably this could be closed. :) |
Could you run a simple test for comparison:
|
|
Load time though as 2nd run ran from memory |
Thanks. Taskset doesn't really change anything for me, just the usual random fluctuations:
I'm very confused 😅 |
Try the vendor supplied distros.
|
Did a
followed by 🤦♂️ :
|
@fquirin Is it more stable with |
Doesn't look like:
Directly after this:
|
Before trying a completely new OS I gave the Ubuntu 23 slim Docker image a chance: Got a new all-time low ^^:
but average looks more like:
I've seen everything from 1.7s to 3.3s in no particular order. Of cause this could still be an issue with the host OS. |
Just tried the 8-bit model on my RPi4 which is running a 32-bit OS: pi@raspberrypi:~/whisper.cpp $ getconf LONG_BIT
32
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 7
whisper_model_load: type = 1
whisper_model_load: mem required = 172.00 MB (+ 3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 43.18 MB
whisper_model_load: model size = 43.14 MB
whisper_init_state: kv self size = 2.62 MB
whisper_init_state: kv cross size = 8.79 MB
system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:08.000] And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000] ask what you can do for your country.
whisper_print_timings: load time = 433.38 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 1068.06 ms
whisper_print_timings: sample time = 192.17 ms / 27 runs ( 7.12 ms per run)
whisper_print_timings: encode time = 9107.05 ms / 1 runs ( 9107.05 ms per run)
whisper_print_timings: decode time = 762.21 ms / 27 runs ( 28.23 ms per run)
whisper_print_timings: total time = 11918.20 ms
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 7
whisper_model_load: type = 1
whisper_model_load: mem required = 172.00 MB (+ 3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 43.18 MB
whisper_model_load: model size = 43.14 MB
whisper_init_state: kv self size = 2.62 MB
whisper_init_state: kv cross size = 8.79 MB
system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:08.000] And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000] ask what you can do for your country.
whisper_print_timings: load time = 429.34 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 1062.75 ms
whisper_print_timings: sample time = 77.46 ms / 27 runs ( 2.87 ms per run)
whisper_print_timings: encode time = 10014.02 ms / 1 runs (10014.02 ms per run)
whisper_print_timings: decode time = 413.60 ms / 27 runs ( 15.32 ms per run)
whisper_print_timings: total time = 12351.25 ms
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 7
whisper_model_load: type = 1
whisper_model_load: mem required = 172.00 MB (+ 3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 43.18 MB
whisper_model_load: model size = 43.14 MB
whisper_init_state: kv self size = 2.62 MB
whisper_init_state: kv cross size = 8.79 MB
system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:08.000] And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000] ask what you can do for your country.
whisper_print_timings: load time = 433.39 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 890.49 ms
whisper_print_timings: sample time = 77.42 ms / 27 runs ( 2.87 ms per run)
whisper_print_timings: encode time = 9910.22 ms / 1 runs ( 9910.22 ms per run)
whisper_print_timings: decode time = 417.30 ms / 27 runs ( 15.46 ms per run)
whisper_print_timings: total time = 12083.65 ms
pi@raspberrypi:~/whisper.cpp $ ./main -m ./models/ggml-tiny.en-q8_0.bin ./samples/jfk.wav -t 3
whisper_init_from_file_no_state: loading model from './models/ggml-tiny.en-q8_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 7
whisper_model_load: type = 1
whisper_model_load: mem required = 172.00 MB (+ 3.00 MB per decoder)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: model ctx = 43.18 MB
whisper_model_load: model size = 43.14 MB
whisper_init_state: kv self size = 2.62 MB
whisper_init_state: kv cross size = 8.79 MB
system_info: n_threads = 3 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | COREML = 0 |
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 3 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:08.000] And so my fellow Americans ask not what your country can do for you
[00:00:08.000 --> 00:00:11.000] ask what you can do for your country.
whisper_print_timings: load time = 435.73 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 1075.16 ms
whisper_print_timings: sample time = 77.48 ms / 27 runs ( 2.87 ms per run)
whisper_print_timings: encode time = 8273.19 ms / 1 runs ( 8273.19 ms per run)
whisper_print_timings: decode time = 414.45 ms / 27 runs ( 15.35 ms per run)
whisper_print_timings: total time = 10632.44 ms
pi@raspberrypi:~/whisper.cpp $ The total time fluctuates around 12s but there is a big variation as well. Last run dropped to 10.6s. |
Download the Focal OPi image to get it out of the equation.
Consequtive |
Indeed, I quickly flashed Ubuntu Jammy server (Ubuntu 22.04.2 LTS) onto a SD card 😲:
[EDIT]
|
Can you bench the # quantize to 8-bits
./quantize models/ggml-tiny.bin models/ggml-tiny-q8_0.bin q8_0 |
With Q8_0 I'm getting pretty consistent: and with Q5_0: |
Running the benchmark with only the encoder gives pretty stable results:
Maybe the 'ggml_mul_mat' benchmark leads to a throttling of the CPU after some time 🤔, but a drop from '3397' to '1179' seems pretty hard. |
|
Since we get very similar and stable results in single runs now, I decided to investigate the degrading performance for longer runs a bit more. Here are 2 consecutive benchmark runs (encoder-only):
And here are a few consecutive single runs with the small model:
I'd say this is a pretty strong indication that my Orange Pi 5 is throttling after about 30s of cooking the CPU 🤔. |
@fquirin Have you ever just opened up another Cli window and monitored the temps vs clock speed? 75c is the throttle point which is quite low for a cpu. I can run stress-ng --cpu 8 --vm 2 --vm-bytes 128M --fork 4 constantly and settle on approx 60c max.
If you want to go all out on a cooling solution then https://www.amazon.com/dp/B0C2T9N9L2? |
Hi I did try to run the bench with CLBlast
|
On Armbian I used
I'll test that again on the Ubuntu system 👍
Looks pretty fancy 😎 |
Yeah I got one of the 'armor' cases and its a strange implementation as the fan sits flat on the metal base and has no space to push air through. The above is prob the most OP cooler you can get for the Opi5 and after all the additions I prob spent about same. But as said a 30/40mm stick with a fan will suffice the SoC is capable of 12watts TDP if I rememeber correctly
|
@StuartIanNaylor Hi, Have you tried running whisper.cpp with GPU on rk3588? Now that I'm having a lot of trouble trying to cross-compile CLBlast to my rk3588 dev board, I'm wondering how much of a speed boost the GPU acceleration with clblast can bring and if it's worth doing? |
Yeah its not all roses, I get
I just ignore and run the cli again and it works even though slow. I think there are problems with the driver and Mali G610 as when you run some of the clblast tuners you get errors forgot which but they are located in /usr/bin/clblast_tuner_copy_fast and all have a different name than sending a param. But the methods needed work. Out of curiosity I tried something non AMD as don't have a Vega board and a HD630 installs the same and seems to behave simulary. But run but are slow and its a question to how much is running on GPU as wondering if its trying and it returns a fail.
Also spuriously things go awry
Its prob still a bit fresh for the mali g610 but I did sucessfully run the ArmNN OPenCL based Wav2Vec example they do, but its very different to whisper as near all load is on the GPU and it looks fantastic as slightly slower than the CPU. I think the Intel is similular and far more less fresh that the Mali G610 which really is still waiting for kernel changes. To be honest I have a gut feeling OpenCL might be similar to OpenGL and not granular enough and tend to limit performance. Maybe what Nvidia say is the way to go and create opensopurce versions of https://developer.nvidia.com/blog/machine-learning-acceleration-vulkan-cooperative-matrices/ https://gist.github.com/itzmeanjan/84613bc7595372c5e6b6c22481d42f9a https://github.com/bartwojcik/vulkano-matmul https://www.khronos.org/assets/uploads/developers/presentations/Cooperative_Matrix_May22.pdf |
FYI: You can run Whisper models with onnxruntime in C++ using sherpa-onnx on Raspberry Pi. You can find the documentation at The following is the RTF running |
Was using the wrong preprocessor macro
Update readme with new usage
is it possible to run this gghml model on raspberry pi hardware?
The text was updated successfully, but these errors were encountered: