Core ML support #566

ggerganov · 2023-03-05T09:27:17Z

Running Whisper inference on Apple Neural Engine (ANE) via Core ML

This PR extends whisper.cpp to run the Whisper Encoder on the ANE through Core ML inference.
The performance gain is more than x3 compared to 8-thread CPU for tiny, base and small models.

Here are initial performance benchmarks for the Encoder on M1 Pro with (top) and without (bottom) Core ML:

CPU	OS	Config	Model	Th	Load [ms]	Encode [ms]	Commit
MacBook M1 Pro	MacOS 13.2.1	CORE ML	tiny	4	50	30	`b0ac915`
MacBook M1 Pro	MacOS 13.2.1	CORE ML	base	4	74	64	`b0ac915`
MacBook M1 Pro	MacOS 13.2.1	CORE ML	small	4	188	208	`b0ac915`
MacBook M1 Pro	MacOS 13.2.1	CORE ML	medium	4	533	1033	`b0ac915`
MacBook M1 Pro	MacOS 13.2.1	CORE ML	large	4	?	?	`b0ac915`
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	8	71	102	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	8	96	220	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	8	233	685	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	8	603	1928	`206fc93`
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	8	1158	3350	`206fc93`
---

This PR adds a helper script models/generate-coreml-model.sh that can be used to easily generate a Core ML Encoder model yourself. For now, I don't plan on hosting the Core ML models as there is some chance that the implementation will change in the future. Therefore, it is recommended that everyone simply generate them locally with that script. See the instructions below.

There are a couple of drawbacks:

First time running a Core ML model on a device takes a long time (several seconds, depending on the model).
All follow-up runs are fast
The medium and large models take a long time to be converted to Core ML (tens of minutes) and require a lot of RAM. First run on a device is also very slow for them, so not sure if these are viable for production use

Acknowledgements

Huge thanks to @wangchou for the initial demonstration of how to use Core ML in whisper.cpp (#548)

Thanks to @RobertRiachi for optimizing for ANE execution and improving the model export process

Thanks to everyone else who participated in #548 and helped with insights, testing and ideas

Usage

Install dependencies:

pip install ane_transformers
pip install openai-whisper
pip install coremltools

Generate a Core ML model. For example, to generate a base.en model, use:
```
./models/generate-coreml-model.sh base.en
```
This will generate the folder models/ggml-base.en-encoder.mlmodelc

Build whisper.cpp with Core ML support:

# using Makefile
make clean
WHISPER_COREML=1 make -j

# using CMake
cd build
cmake -DWHISPER_COREML=1 ..

Run the examples as usual. For example:

./main -m models/ggml-base.en.bin -f samples/gb0.wav

...

whisper_init_state: loading Core ML model from 'models/ggml-base.en-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 | 

...

The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format.
Next runs are faster.

TODO

Future work

Fix the ANE-optimized Whisper imeplementation. Currently, there is something wrong with the tensor shapes when passed to from / to whisper.cpp and the transcription gets corrupted. The optimized version should be about 1.5x faster than the original one
Add support for decoder-only ggml models. This will avoid having to store the Encoder data 2 times on disk / memory. Currently, it is store one time in the ggml model and another time in the Core ML model. This will reduce both disk and memory usage
Add support for running the Decoder on the ANE. Due to the nature of the Decoder operations, it seems that running on the CPU is generally more efficient in terms of speed compared to running it on the ANE. However, an ANE Decoder should be much more energy-efficient compared to the CPU one, so having this option could be useful in some cases

brozkrut · 2023-03-06T16:27:05Z

Great work!

I tested coreml branch on Mac Mini M2 (base $599 model).

The performance gain seems to be more than x5 compared to 4-thread CPU (thanks to much faster ANE on M2, 8-thread CPU on Mac Mini M2 base model is slower than 4-thread).

Performance benchmarks for the Encoder with (top) and without (bottom) Core ML:

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Mac Mini M2	macOS 13.2.1	CORE ML	tiny	4	44	25	`17a1459`
Mac Mini M2	macOS 13.2.1	CORE ML	base	4	66	54	`17a1459`
Mac Mini M2	macOS 13.2.1	CORE ML	small	4	163	190	`17a1459`
Mac Mini M2	macOS 13.2.1	CORE ML	medium	4			`17a1459`
Mac Mini M2	macOS 13.2.1	CORE ML	large	4			`17a1459`

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Mac Mini M2	macOS 13.2.1	NEON BLAS	tiny	4	40	142	`59fdcd1`
Mac Mini M2	macOS 13.2.1	NEON BLAS	base	4	67	299	`59fdcd1`
Mac Mini M2	macOS 13.2.1	NEON BLAS	small	4	152	980	`59fdcd1`
Mac Mini M2	macOS 13.2.1	NEON BLAS	medium	4			`59fdcd1`
Mac Mini M2	macOS 13.2.1	NEON BLAS	large	4			`59fdcd1`

DontEatOreo · 2023-03-09T11:32:16Z

I compiled whisper.cpp with coreml support using make as well I build the mlmodel but I'm getting an error

whisper_init_from_file: loading model from 'models/ggml-base.en.mlmodelc'
whisper_model_load: loading model
whisper_model_load: invalid model data (bad magic)
whisper_init: failed to load model
error: failed to initialize whisper context

Is there anything else I'm missing? 🤔

ggerganov · 2023-03-09T11:43:20Z

@DontEatOreo

On the command line, you still have to specify the non-coreml model: models/ggml-base.en.bin.
The code will automatically also load the models/ggml-base.en.mlmodelc if it is present in the same folder.

DontEatOreo · 2023-03-09T12:24:03Z

@ggerganov Благодаря ти! I was very confused why it wasn't working even though I did everything right

dennislysenko · 2023-03-22T19:19:57Z

This is great. Excited to see how this feature develops. Leveraging ANE would be huge, even more if the decoder was possible to port to it.

strangelearning · 2023-03-24T17:37:30Z

Just saw this was announced, is it useful? https://github.com/apple/ml-ane-transformers

cerupcat · 2023-04-05T18:50:45Z

@DontEatOreo

On the command line, you still have to specify the non-coreml model: models/ggml-base.en.bin. The code will automatically also load the models/ggml-base.en.mlmodelc if it is present in the same folder.

Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

lucabeetz · 2023-04-14T14:18:30Z

Hey, thanks for this awesome project! I am trying to run the whisper.objc example with CoreML but running into some issues. Has someone successfully done this and could guide me on how to set it up?

ggerganov · 2023-04-14T17:40:43Z

@DontEatOreo
On the command line, you still have to specify the non-coreml model: models/ggml-base.en.bin. The code will automatically also load the models/ggml-base.en.mlmodelc if it is present in the same folder.

Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all.

The solution is to produce encoder-only CoreML model in one file and decoder-only standard model in another file. This is not very difficult to achieve, but supporting so many model files might get too difficult for me. So probably I will rely on someone helping out and demonstrating how this can be done, either as an example in this repo or in a fork.

ggerganov · 2023-04-14T19:30:07Z

This is getting almost ready to merge. I am hoping to do it tomorrow.

The most important part that currently needs testing is the creation of the CoreML models, following the instructions here:

#548 (reply in thread)

If you give this a try, please let us know the results and if you encountered any issues.
Also, lets us know if you used quantized or not-quantized CoreML models and what has been the experience.

I believe that tiny, base and small models should be supported, while medium and large seem to not be viable for this approach.

aehlke · 2023-04-14T20:59:09Z

1.4gb for medium sounds fine for users, but you're saying there are other limitations against it?

ggerganov · 2023-04-15T09:56:54Z

@aehlke The scripts for generating Core ML models, support all sizes, but on my M1 Pro, it takes very long time (i.e. more than half an hour) to generate the medium model. After that, the first run is also very slow. Next runs are about 2 times faster compared to CPU-only.

In any case, you can follow the instructions in this PR and see how it works on your device.

neurostar · 2023-04-15T19:22:00Z

CPU	OS	Config	Model	Th	Load	Enc.	Commit
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	tiny	4	41	31	`f19e23f`
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	base	4	59	57	`f19e23f`
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	small	4	147	195	`f19e23f`
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	medium	4	576	783	`f19e23f`
MacBook Air M2	MacOS 13.3.1	NEON BLAS COREML	large	4	1196	2551	`f19e23f`

Great work!
It was consuming ~9.7GB (short peak 15.03GB) memory converting large model to ML model, it worked fine on 8GB Air.

Edit:
I measured time of COREML model conversion and first loading conversion time (second-first).

Model	COREML conv	First Loading conv (sec)
tiny	4.915	0.72
base	8.564	1.34
small	26.050	4.72
medium	1:35.85	15.57
large	3:43.32	35.10

CarberryChai · 2023-04-16T04:28:48Z

When running this script:

./models/generate-coreml-model.sh base.en

I got the error:

xcrun: error: unable to find utility "coremlc", not a developer tool or in PATH

flexchar · 2023-04-16T09:31:01Z

Is it me or the link of CoreML models is missing on Hugging Face?

Btw, @ggerganov, if you need help converting the models, I'd be glad to contribute. It seems to me that it only needs to be done once. :)

ggerganov · 2023-04-16T10:11:52Z

For now, you should generate the Core ML models locally following the instructions.
I don't want to host them on HF yet, because it is very likely that the models will change soon - there are a some pending improvements (see #548 (reply in thread)). If I upload them now, later we will get new models and everyone will be confused which model they are using, etc.

flexchar · 2023-04-16T11:02:46Z

In that regard I'd like to ask for help since I cant seem to succeed with it..

python3.10 ./models/convert-whisper-to-coreml.py --model tiny

100%|█████████████████████████████████████| 72.1M/72.1M [00:05<00:00, 14.3MiB/s]
ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=384, n_audio_head=6, n_audio_layer=4, n_vocab=51865, n_text_ctx=448, n_text_state=384, n_text_head=6, n_text_layer=4)
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  scale = (n_state // self.n_head) ** -0.25
Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋| 367/368 [00:00<00:00, 6681.50 ops/s]
Running MIL frontend_pytorch pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1047.63 passes/s]
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [00:00<00:00, 147.77 passes/s]
Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2599.51 passes/s]
Traceback (most recent call last):
  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 331, in <module>
    decoder = convert_decoder(hparams, decoder, quantize=args.quantize)
  File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 283, in convert_decoder
    traced_model = torch.jit.trace(model, (token_data, audio_data))
  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 741, in trace
    return trace_module(
  File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 958, in trace_module
    module._c._create_method_from_trace(
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 211, in forward
    x = block(x, xa, mask=self.mask, kv_cache=kv_cache)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 138, in forward
    x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 83, in forward
    k = self.key(x if xa is None else xa)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 37, in forward
    return F.linear(
RuntimeError: mat1 and mat2 shapes cannot be multiplied (384x1500 and 384x384)

bjnortier · 2023-08-16T16:22:15Z

@vadi2 Transcription speed is actually slower on macOS 14 (latest Beta), but the CoreML for "small" takes 35 secs on 14 vs 17 minutes on macOS 13. I assume there's some bugfix because "tiny" and "base" have similar CoreML optimisation times.

Table below show time in seconds for CPU & CoreML transcription on a 5sec and 30sec sample, and CoreML caching times

'arm64' with OS Version 'Version 13.4.1 (c) (Build 22F770820d)':

	transcribeCPU5s	transcribeCPU30s	cacheCoreML	transcribeCoreML5s	transcribeCoreML30s
tiny	0.42	0.89	4.04	0.50	1.00
base	0.85	2.29	8.75	0.74	1.91
small	2.44	5.85	2522.23	1.49	4.02
medium	6.37	14.22
large-v2	10.50	24.64

Table for device 'arm64' with OS Version 'Version 14.0 (Build 23A5312d)':

	transcribeCPU5s	transcribeCPU30s	cacheCoreML	transcribeCoreML5s	transcribeCoreML30s
tiny	1.98	2.83	4.79	1.21	2.37
base	2.49	5.33	9.78	1.67	3.74
small	5.51	11.92	35.07	4.46	7.87
medium	13.47	26.38
large-v2	23.29	45.28

astrowonk · 2023-08-16T18:09:51Z

Very interesting, thanks so much for sharing. The improved cache times are great, I wonder what’s up with the slower processing times though. Presumably those are related. But CPU only also is slower than CoreML encoding speed so it can’t just be CoreML changes. 🤔

ganqqwerty · 2023-09-02T06:57:09Z

These are stuck forever on M1 64G. I waited for 12 hours but still got no more messages. MacOS 13.5 (22G74).

whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...

artemgordinskiy · 2023-09-02T07:23:02Z

I finally managed to get it to work on the "beta" (v1.4.2), with the same HW and OS as @ganqqwerty:

Built with WHISPER_COREML=1 make -j.
Downloaded the large CoreML model from Huggingface
Ran a sample overnight (~11 hours):

~/D/whisper.cpp ❯❯❯ ./main -m models/ggml-large.bin -f samples/jfk.wav                                                                          [20:34:38]
whisper_init_from_file_no_state: loading model from 'models/ggml-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5
whisper_model_load: mem required  = 3557.00 MB (+   71.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 2951.27 MB
whisper_model_load: model size    = 2950.66 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   985.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    35.84 ms
whisper_print_timings:   sample time =    11.56 ms /    27 runs (    0.43 ms per run)
whisper_print_timings:   encode time =  3036.61 ms /     1 runs ( 3036.61 ms per run)
whisper_print_timings:   decode time =   794.28 ms /    27 runs (   29.42 ms per run)
whisper_print_timings:    total time = 40924196.00 ms

And the consecutive runs go much faster now, with the model loading in just a few seconds:

whisper_print_timings:     load time =  1141.81 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    35.75 ms
whisper_print_timings:   sample time =    11.45 ms /    27 runs (    0.42 ms per run)
whisper_print_timings:   encode time =  3596.32 ms /     1 runs ( 3596.32 ms per run)
whisper_print_timings:   decode time =   825.67 ms /    27 runs (   30.58 ms per run)
whisper_print_timings:    total time =  6655.50 ms

Does anyone know what happened during those 11 hours and why it runs faster now? If the model got "compiled" or whatever, can't I just upload it for other people to use? I don't see any changes to the model files since I downloaded them 🤔

cust0mphase · 2023-09-02T13:42:42Z

I finally managed to get it to work on the "beta" (v1.4.2), with the same HW and OS as @ganqqwerty:

Built with WHISPER_COREML=1 make -j.
Downloaded the large CoreML model from Huggingface
Ran a sample overnight (~11 hours):

~/D/whisper.cpp ❯❯❯ ./main -m models/ggml-large.bin -f samples/jfk.wav                                                                          [20:34:38]
whisper_init_from_file_no_state: loading model from 'models/ggml-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5
whisper_model_load: mem required  = 3557.00 MB (+   71.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     = 2951.27 MB
whisper_model_load: model size    = 2950.66 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded

system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 1 |

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   985.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    35.84 ms
whisper_print_timings:   sample time =    11.56 ms /    27 runs (    0.43 ms per run)
whisper_print_timings:   encode time =  3036.61 ms /     1 runs ( 3036.61 ms per run)
whisper_print_timings:   decode time =   794.28 ms /    27 runs (   29.42 ms per run)
whisper_print_timings:    total time = 40924196.00 ms

And the consecutive runs go much faster now, with the model loading in just a few seconds:

whisper_print_timings:     load time =  1141.81 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    35.75 ms
whisper_print_timings:   sample time =    11.45 ms /    27 runs (    0.42 ms per run)
whisper_print_timings:   encode time =  3596.32 ms /     1 runs ( 3596.32 ms per run)
whisper_print_timings:   decode time =   825.67 ms /    27 runs (   30.58 ms per run)
whisper_print_timings:    total time =  6655.50 ms

Does anyone know what happened during those 11 hours and why it runs faster now? If the model got "compiled" or whatever, can't I just upload it for other people to use? I don't see any changes to the model files since I downloaded them 🤔

Can you upload it, please?

artemgordinskiy · 2023-09-02T13:58:49Z

@cust0mphase Upload what? The CoreML model link is in my comment above, and as far as I can see, the files have not changed since I downloaded them.

ganqqwerty · 2023-09-03T11:40:19Z

I confirm that it works well with model from hugging face (of course, i use large). The performance boost in ventura 13.5 (22G74) is not that big, maybe 20%, but it's definitely faster. can't wait when the new OS come out.

dhwkdjwndjwjjn · 2023-09-18T18:46:52Z

Hi, I have a question. I was able to run the Core ML models perfectly on my MacBook Pro M1 Pro. However, when I look at the CPU/GPU/ANE usage through powermetrics while transcribing through Core ML models, I noticed the ANE usage is 0% throughout the transcription and GPU use is 100%. So how do we actually make Core ML run on ANE?

Also I can confirm that macOS Sonoma 14.0 Beta did a much faster job at converting to Core ML Model, I was able to convert the large model in under an hour. While in macOS 13, my conversion for large model would get stuck overnight and never finish.

Last question, can we and how can we run the real time transcription ./stream with the Core ML model? I was only able to run ./stream with normal model.

Thanks, great work for the author/authors of whisper c++!

dhwkdjwndjwjjn · 2023-09-18T19:44:58Z

Hi, I have a question. I was able to run the Core ML models perfectly on my MacBook Pro M1 Pro. However, when I look at the CPU/GPU/ANE usage through powermetrics while transcribing through Core ML models, I noticed the ANE usage is 0% throughout the transcription and GPU use is 100%. So how do we actually make Core ML run on ANE?
Also I can confirm that macOS Sonoma 14.0 Beta did a much faster job at converting to Core ML Model, I was able to convert the large model in under an hour. While in macOS 13, my conversion for large model would get stuck overnight and never finish.

Last question, can we and how can we run the real time transcription ./stream with the Core ML model? I was only able to run ./stream with normal model.

Thanks, great work for the author/authors of whisper c++!

Ok I just found out how to do it from other's discussion....

You can set it in file coreml/whisper-encoder.mm

And as for running Core ML with ./stream, you just need to run:

make clean
WHISPER_COREML=1 make stream -j

and then you can just ran ./stream normally and Core ML model will be loaded.

dreampuf · 2023-09-19T09:01:16Z

FYI: comparing with CPU+GPU vs. CPU+ANE:

# CPU + GPU
whisper_print_timings:     load time =   185.77 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   729.95 ms
whisper_print_timings:   sample time =  3544.57 ms /  8631 runs (    0.41 ms per run)
whisper_print_timings:   encode time =  8853.00 ms /    49 runs (  180.67 ms per run)
whisper_print_timings:   decode time = 50679.41 ms /  8576 runs (    5.91 ms per run)
whisper_print_timings:   prompt time =  1938.64 ms /    52 runs (   37.28 ms per run)
whisper_print_timings:    total time = 66302.43 ms

## second-time
whisper_print_timings:     load time =   306.99 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   666.95 ms
whisper_print_timings:   sample time =  3934.44 ms /  8631 runs (    0.46 ms per run)
whisper_print_timings:   encode time =  7717.25 ms /    49 runs (  157.49 ms per run)
whisper_print_timings:   decode time = 51892.14 ms /  8576 runs (    6.05 ms per run)
whisper_print_timings:   prompt time =  1951.12 ms /    52 runs (   37.52 ms per run)
whisper_print_timings:    total time = 67378.17 ms

# CPU + ANE
whisper_print_timings:     load time =   426.37 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   655.52 ms
whisper_print_timings:   sample time =  4105.80 ms /  9129 runs (    0.45 ms per run)
whisper_print_timings:   encode time = 10249.34 ms /    48 runs (  213.53 ms per run)
whisper_print_timings:   decode time = 55378.71 ms /  9073 runs (    6.10 ms per run)
whisper_print_timings:   prompt time =  1981.35 ms /    52 runs (   38.10 ms per run)
whisper_print_timings:    total time = 73484.55 ms

# CPU + ALL
whisper_print_timings:     load time =   328.41 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   699.48 ms
whisper_print_timings:   sample time =  4050.11 ms /  9129 runs (    0.44 ms per run)
whisper_print_timings:   encode time = 10222.64 ms /    48 runs (  212.97 ms per run)
whisper_print_timings:   decode time = 54836.89 ms /  9073 runs (    6.04 ms per run)
whisper_print_timings:   prompt time =  1984.60 ms /    52 runs (   38.17 ms per run)
whisper_print_timings:    total time = 72802.16 ms

astrowonk · 2023-10-01T20:16:04Z

I don't have precise before/after numbers, but CoreML Whisper sure seems a lot faster on Sonoma. Not just the "first run on a device may take a while …" step which is almost instant now, but the actual encoding seems better?

Maybe this is something improved in the latest versions of Whisper.cpp itself but it runs at close to 100% GPU usage now which I don't remember if that was always the case. ~5x faster than realtime with the medium.en model on my lowly regular M1.

dreampuf · 2023-10-09T09:04:28Z

Here is an update after Sonoma.

# CPU + GPU
whisper_print_timings:     load time =   298.31 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   687.01 ms
whisper_print_timings:   sample time =  3626.06 ms /  8863 runs (    0.41 ms per run)
whisper_print_timings:   encode time =  9034.63 ms /    48 runs (  188.22 ms per run)
whisper_print_timings:   decode time = 52123.91 ms /  8810 runs (    5.92 ms per run)
whisper_print_timings:   prompt time =  1883.27 ms /    51 runs (   36.93 ms per run)
whisper_print_timings:    total time = 69305.77 ms
ggml_metal_free: deallocating

# 2rd round
whisper_print_timings:     load time =   220.71 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =   659.20 ms
whisper_print_timings:   sample time =  3607.61 ms /  8863 runs (    0.41 ms per run)
whisper_print_timings:   encode time =  7268.91 ms /    48 runs (  151.44 ms per run)
whisper_print_timings:   decode time = 52101.25 ms /  8810 runs (    5.91 ms per run)
whisper_print_timings:   prompt time =  1880.41 ms /    51 runs (   36.87 ms per run)
whisper_print_timings:    total time = 66078.09 ms

# CPU + ANE
whisper_print_timings:     load time =   290.60 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   674.62 ms
whisper_print_timings:   sample time =  3722.67 ms /  9019 runs (    0.41 ms per run)
whisper_print_timings:   encode time = 10463.12 ms /    48 runs (  217.98 ms per run)
whisper_print_timings:   decode time = 52677.20 ms /  8963 runs (    5.88 ms per run)
whisper_print_timings:   prompt time =  1935.95 ms /    52 runs (   37.23 ms per run)
whisper_print_timings:    total time = 105001.48 ms

# 2rd round
whisper_print_timings:     load time =   218.93 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   647.12 ms
whisper_print_timings:   sample time =  3874.24 ms /  9019 runs (    0.43 ms per run)
whisper_print_timings:   encode time = 10568.01 ms /    48 runs (  220.17 ms per run)
whisper_print_timings:   decode time = 53258.39 ms /  8963 runs (    5.94 ms per run)
whisper_print_timings:   prompt time =  1956.66 ms /    52 runs (   37.63 ms per run)
whisper_print_timings:    total time = 70788.73 ms

# CPU + ANE + GPU
whisper_print_timings:     load time =   203.14 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   679.72 ms
whisper_print_timings:   sample time =  3868.27 ms /  9019 runs (    0.43 ms per run)
whisper_print_timings:   encode time = 10651.40 ms /    48 runs (  221.90 ms per run)
whisper_print_timings:   decode time = 53248.52 ms /  8963 runs (    5.94 ms per run)
whisper_print_timings:   prompt time =  1942.67 ms /    52 runs (   37.36 ms per run)
whisper_print_timings:    total time = 105808.82 ms

# 2rd round
whisper_print_timings:     load time =   223.98 ms
whisper_print_timings:     fallbacks =   2 p /   0 h
whisper_print_timings:      mel time =   650.97 ms
whisper_print_timings:   sample time =  3727.37 ms /  9019 runs (    0.41 ms per run)
whisper_print_timings:   encode time = 10526.05 ms /    48 runs (  219.29 ms per run)
whisper_print_timings:   decode time = 53171.40 ms /  8963 runs (    5.93 ms per run)
whisper_print_timings:   prompt time =  1950.87 ms /    52 runs (   37.52 ms per run)
whisper_print_timings:    total time = 70573.20 ms

* coreml : use Core ML encoder inference * coreml : simlpify whisper_encode + log messages * whisper : resolve rebase conflicts * coreml : add scripts for CoreML model generation * bench-all : recognize COREML flag

helins · 2024-02-15T11:34:14Z

I was happy with the regular setup on my M1 (Sonoma) so I gave the CoreML setup a try, expecting it to be even better. However I am very surprised to see that it completely degraded performance, at least for the models I am using (medium.en and large-v3). For instance stream became unusable, both slow and inaccurate.

I'll revert to the regular setup but I am very curious as to why using ANE degraded performance so much, it is counterintuitive. I don't spot any errors, the CoreML models seem to load indeed and I can see ANE kick in using powermetrics. Disclaimer, in case it makes a difference, I used the prebuilt models on HF.

astrowonk · 2024-02-15T15:41:48Z

I was happy with the regular setup on my M1 (Sonoma) so I gave the CoreML setup a try, expecting it to be even better. However I am very surprised to see that it completely degraded performance, at least for the models I am using (medium.en and large-v3). For instance stream became unusable, both slow and inaccurate.

I'll revert to the regular setup but I am very curious as to why using ANE degraded performance so much, it is counterintuitive. I don't spot any errors, the CoreML models seem to load indeed and I can see ANE kick in using powermetrics. Disclaimer, in case it makes a difference, I used the prebuilt models on HF.

Did you try the same model twice? There is still a considerable delay for me the first time the CoreML models run, but it is a little faster than the standard build for me after that. However I see very little ANE usage when I compile for CoreML, it's almost all GPU for me.

helins · 2024-02-15T15:47:45Z

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model with stream for live transcription.

astrowonk · 2024-02-15T16:35:22Z

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model with stream for live transcription.

I'm not 100% sure but after this PR it might be worth trying converting the models to CoreML yourself, depending on when/how the huggingface CoreML models were made.

RazeBerry · 2024-03-18T23:31:43Z

Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model with stream for live transcription.

I'm not 100% sure but after this PR it might be worth trying converting the models to CoreML yourself, depending on when/how the huggingface CoreML models were made.

Just converted myself and took about 10 mins on M2 Pro + 16GB ram

shell1986 · 2024-03-27T01:52:54Z

My model does not start, it just says that it does not find the file. Although the model is compiled and is in the folder.

sahmed53 · 2024-05-18T13:34:45Z

I have posted this in the main issues section too (i apologise for the double post), but think maybe here people might be able to reply since it is a specific CoreML thread. My problem is about using CoreML in iOS apps i have noticed that the size of the app jump dramatically everytime coreML is fired up. Downloading the app container in xcode doesnt seem to show why the "documents and data" increases to many mb and sometimes gb with repeated usage. So i was wondering if anyone here has used the Objective-C sample or similar, can they check the app size after running - setting -> general -> storage? where could the app be saving coreml files? what could be going on?

This only happens with CoreML not Metal
I have cleared caches and temp files, but it doesnt effect the documents and data
xcode container doesnt not equal the same size as the settings indicator
I have looked in instruments but can't find the directory where files are being written to, it shows a tmp folder being written to with ANE weights?

This issue means it can't be deployed in production ready apps?

Please someone help!

day-dreaming-guy · 2024-07-28T09:54:23Z

Hey @sahmed53 ! Have you solved it?

bjnortier · 2024-07-29T06:02:51Z

When you load a CoreML model the first time it does an optimisation and it saves that optimised model somewhere. I could never figure out where – it is something internal and hidden. I suspect that's what you're seeing. Sometimes the OS will delete those files (I assume when storage is low) and then when you load the CoreML model again it will do the optimisation step again. This can take very long on some devices.

This is why I've stopped using CoreML for my app and I only use the Metal version.

aehlke · 2024-07-29T16:08:24Z

@bjnortier does WhisperKit suffer from the same issue? they became quite popular and rely on CoreML rather than Metal

bjnortier · 2024-07-29T17:52:45Z

@aehlke Yes, if you use the WhisperKit macOS TestFlight app you will see "Specializing [...] for you device... This can take several minutes on first load"

* coreml : use Core ML encoder inference * coreml : simlpify whisper_encode + log messages * whisper : resolve rebase conflicts * coreml : add scripts for CoreML model generation * bench-all : recognize COREML flag

scosman mentioned this pull request Mar 9, 2023

Whisper Model moved to CoreML scosman/voicebox#9

Open

aehlke mentioned this pull request Mar 11, 2023

Implement Whisper Model OpenDive/OpenAIKit#36

Closed

wangchou mentioned this pull request Apr 14, 2023

Get an "incorrect audio shape" assert when I process the whisper models wangchou/callCoreMLFromCppOrPython#1

Closed

ggerganov added 3 commits April 14, 2023 22:09

coreml : use Core ML encoder inference

b328977

coreml : simlpify whisper_encode + log messages

4f7963e

whisper : resolve rebase conflicts

73cd216

ggerganov force-pushed the coreml branch from 17a1459 to 73cd216 Compare April 14, 2023 19:17

ggerganov marked this pull request as ready for review April 14, 2023 19:27

coreml : add scripts for CoreML model generation

28b3232

ggerganov force-pushed the coreml branch from b2e7ef6 to 28b3232 Compare April 15, 2023 09:40

bench-all : recognize COREML flag

5fda9b1

ggerganov merged commit 5e47e22 into master Apr 15, 2023

ggerganov deleted the coreml branch April 15, 2023 10:21

akashmjn mentioned this pull request Aug 21, 2023

tdrz and coreml support? #1088

Open

astrowonk mentioned this pull request Dec 9, 2023

Utilise MLX framework on Apple Silicon #1598

Open

DainisGorbunovs mentioned this pull request Jun 27, 2024

Unable to generate large-v3 quantized coreml model #2042

Open

Core ML support #566

Core ML support #566

Conversation

ggerganov commented Mar 5, 2023 • edited Loading

Running Whisper inference on Apple Neural Engine (ANE) via Core ML

Acknowledgements

Usage

TODO

Future work

brozkrut commented Mar 6, 2023

DontEatOreo commented Mar 9, 2023

ggerganov commented Mar 9, 2023

DontEatOreo commented Mar 9, 2023

dennislysenko commented Mar 22, 2023

strangelearning commented Mar 24, 2023

cerupcat commented Apr 5, 2023

lucabeetz commented Apr 14, 2023

ggerganov commented Apr 14, 2023

ggerganov commented Apr 14, 2023

aehlke commented Apr 14, 2023

ggerganov commented Apr 15, 2023

neurostar commented Apr 15, 2023 • edited Loading

CarberryChai commented Apr 16, 2023

flexchar commented Apr 16, 2023

ggerganov commented Apr 16, 2023

flexchar commented Apr 16, 2023

bjnortier commented Aug 16, 2023

'arm64' with OS Version 'Version 13.4.1 (c) (Build 22F770820d)':

Table for device 'arm64' with OS Version 'Version 14.0 (Build 23A5312d)':

astrowonk commented Aug 16, 2023

ganqqwerty commented Sep 2, 2023

artemgordinskiy commented Sep 2, 2023 • edited Loading

cust0mphase commented Sep 2, 2023

artemgordinskiy commented Sep 2, 2023

ganqqwerty commented Sep 3, 2023 • edited Loading

dhwkdjwndjwjjn commented Sep 18, 2023

dhwkdjwndjwjjn commented Sep 18, 2023

dreampuf commented Sep 19, 2023

astrowonk commented Oct 1, 2023 • edited Loading

dreampuf commented Oct 9, 2023

helins commented Feb 15, 2024

astrowonk commented Feb 15, 2024 • edited Loading

helins commented Feb 15, 2024

astrowonk commented Feb 15, 2024

RazeBerry commented Mar 18, 2024

shell1986 commented Mar 27, 2024

sahmed53 commented May 18, 2024 • edited Loading

day-dreaming-guy commented Jul 28, 2024 • edited Loading

bjnortier commented Jul 29, 2024

aehlke commented Jul 29, 2024 • edited Loading

bjnortier commented Jul 29, 2024

ggerganov commented Mar 5, 2023 •

edited

Loading

neurostar commented Apr 15, 2023 •

edited

Loading

artemgordinskiy commented Sep 2, 2023 •

edited

Loading

ganqqwerty commented Sep 3, 2023 •

edited

Loading

astrowonk commented Oct 1, 2023 •

edited

Loading

astrowonk commented Feb 15, 2024 •

edited

Loading

sahmed53 commented May 18, 2024 •

edited

Loading

day-dreaming-guy commented Jul 28, 2024 •

edited

Loading

aehlke commented Jul 29, 2024 •

edited

Loading