-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core ML support #566
Core ML support #566
Conversation
Great work! I tested coreml branch on Mac Mini M2 (base $599 model). The performance gain seems to be more than x5 compared to 4-thread CPU (thanks to much faster ANE on M2, 8-thread CPU on Mac Mini M2 base model is slower than 4-thread). Performance benchmarks for the Encoder with (top) and without (bottom) Core ML:
|
I compiled whisper.cpp with coreml support using
Is there anything else I'm missing? 🤔 |
On the command line, you still have to specify the non-coreml model: |
@ggerganov Благодаря ти! I was very confused why it wasn't working even though I did everything right |
This is great. Excited to see how this feature develops. Leveraging ANE would be huge, even more if the decoder was possible to port to it. |
Just saw this was announced, is it useful? https://github.com/apple/ml-ane-transformers |
Does this mean we have to bundle both files with the app? Asking since the file size gets fairly large having to include them all. |
Hey, thanks for this awesome project! I am trying to run the whisper.objc example with CoreML but running into some issues. Has someone successfully done this and could guide me on how to set it up? |
The solution is to produce encoder-only CoreML model in one file and decoder-only standard model in another file. This is not very difficult to achieve, but supporting so many model files might get too difficult for me. So probably I will rely on someone helping out and demonstrating how this can be done, either as an example in this repo or in a fork. |
This is getting almost ready to merge. I am hoping to do it tomorrow. The most important part that currently needs testing is the creation of the CoreML models, following the instructions here: If you give this a try, please let us know the results and if you encountered any issues. I believe that |
1.4gb for medium sounds fine for users, but you're saying there are other limitations against it? |
@aehlke The scripts for generating Core ML models, support all sizes, but on my M1 Pro, it takes very long time (i.e. more than half an hour) to generate the In any case, you can follow the instructions in this PR and see how it works on your device. |
Great work! Edit:
|
When running this script: ./models/generate-coreml-model.sh base.en I got the error:
|
Is it me or the link of CoreML models is missing on Hugging Face? Btw, @ggerganov, if you need help converting the models, I'd be glad to contribute. It seems to me that it only needs to be done once. :) |
For now, you should generate the Core ML models locally following the instructions. |
In that regard I'd like to ask for help since I cant seem to succeed with it..
100%|█████████████████████████████████████| 72.1M/72.1M [00:05<00:00, 14.3MiB/s]
ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=384, n_audio_head=6, n_audio_layer=4, n_vocab=51865, n_text_ctx=448, n_text_state=384, n_text_head=6, n_text_layer=4)
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
/opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
scale = (n_state // self.n_head) ** -0.25
Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████▋| 367/368 [00:00<00:00, 6681.50 ops/s]
Running MIL frontend_pytorch pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 1047.63 passes/s]
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [00:00<00:00, 147.77 passes/s]
Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 2599.51 passes/s]
Traceback (most recent call last):
File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 331, in <module>
decoder = convert_decoder(hparams, decoder, quantize=args.quantize)
File "/Users/luke/dev/whisper.cpp/./models/convert-whisper-to-coreml.py", line 283, in convert_decoder
traced_model = torch.jit.trace(model, (token_data, audio_data))
File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 741, in trace
return trace_module(
File "/opt/homebrew/lib/python3.10/site-packages/torch/jit/_trace.py", line 958, in trace_module
module._c._create_method_from_trace(
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 211, in forward
x = block(x, xa, mask=self.mask, kv_cache=kv_cache)
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 138, in forward
x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 83, in forward
k = self.key(x if xa is None else xa)
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1098, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/homebrew/lib/python3.10/site-packages/whisper/model.py", line 37, in forward
return F.linear(
RuntimeError: mat1 and mat2 shapes cannot be multiplied (384x1500 and 384x384) |
@vadi2 Transcription speed is actually slower on macOS 14 (latest Beta), but the CoreML for "small" takes 35 secs on 14 vs 17 minutes on macOS 13. I assume there's some bugfix because "tiny" and "base" have similar CoreML optimisation times. Table below show time in seconds for CPU & CoreML transcription on a 5sec and 30sec sample, and CoreML caching times 'arm64' with OS Version 'Version 13.4.1 (c) (Build 22F770820d)':
Table for device 'arm64' with OS Version 'Version 14.0 (Build 23A5312d)':
|
Very interesting, thanks so much for sharing. The improved cache times are great, I wonder what’s up with the slower processing times though. Presumably those are related. But CPU only also is slower than CoreML encoding speed so it can’t just be CoreML changes. 🤔 |
These are stuck forever on M1 64G. I waited for 12 hours but still got no more messages. MacOS 13.5 (22G74).
|
I finally managed to get it to work on the "beta" (
Does anyone know what happened during those 11 hours and why it runs faster now? If the model got "compiled" or whatever, can't I just upload it for other people to use? I don't see any changes to the model files since I downloaded them 🤔 |
Can you upload it, please? |
@cust0mphase Upload what? The CoreML model link is in my comment above, and as far as I can see, the files have not changed since I downloaded them. |
I confirm that it works well with model from hugging face (of course, i use large). The performance boost in ventura 13.5 (22G74) is not that big, maybe 20%, but it's definitely faster. can't wait when the new OS come out. |
FYI: comparing with CPU+GPU vs. CPU+ANE:
|
I don't have precise before/after numbers, but CoreML Whisper sure seems a lot faster on Sonoma. Not just the "first run on a device may take a while …" step which is almost instant now, but the actual encoding seems better? Maybe this is something improved in the latest versions of Whisper.cpp itself but it runs at close to 100% GPU usage now which I don't remember if that was always the case. ~5x faster than realtime with the |
Here is an update after Sonoma.
|
* coreml : use Core ML encoder inference * coreml : simlpify whisper_encode + log messages * whisper : resolve rebase conflicts * coreml : add scripts for CoreML model generation * bench-all : recognize COREML flag
* coreml : use Core ML encoder inference * coreml : simlpify whisper_encode + log messages * whisper : resolve rebase conflicts * coreml : add scripts for CoreML model generation * bench-all : recognize COREML flag
* coreml : use Core ML encoder inference * coreml : simlpify whisper_encode + log messages * whisper : resolve rebase conflicts * coreml : add scripts for CoreML model generation * bench-all : recognize COREML flag
I was happy with the regular setup on my M1 (Sonoma) so I gave the CoreML setup a try, expecting it to be even better. However I am very surprised to see that it completely degraded performance, at least for the models I am using (medium.en and large-v3). For instance I'll revert to the regular setup but I am very curious as to why using ANE degraded performance so much, it is counterintuitive. I don't spot any errors, the CoreML models seem to load indeed and I can see ANE kick in using powermetrics. Disclaimer, in case it makes a difference, I used the prebuilt models on HF. |
Did you try the same model twice? There is still a considerable delay for me the first time the CoreML models run, but it is a little faster than the standard build for me after that. However I see very little ANE usage when I compile for CoreML, it's almost all GPU for me. |
Several times, yes. The first run took easily 15 minutes to prepare, as warned. But what I described applies to subsequent runs, when everything was ready, it really underperforms to the point of being unusable. Now I am back to the regular setup (metal) and everything is fine once again, I can easily use the large-v3 model with |
I'm not 100% sure but after this PR it might be worth trying converting the models to CoreML yourself, depending on when/how the huggingface CoreML models were made. |
Just converted myself and took about 10 mins on M2 Pro + 16GB ram |
My model does not start, it just says that it does not find the file. Although the model is compiled and is in the folder. |
I have posted this in the main issues section too (i apologise for the double post), but think maybe here people might be able to reply since it is a specific CoreML thread. My problem is about using CoreML in iOS apps i have noticed that the size of the app jump dramatically everytime coreML is fired up. Downloading the app container in xcode doesnt seem to show why the "documents and data" increases to many mb and sometimes gb with repeated usage. So i was wondering if anyone here has used the Objective-C sample or similar, can they check the app size after running - setting -> general -> storage? where could the app be saving coreml files? what could be going on? This only happens with CoreML not Metal This issue means it can't be deployed in production ready apps? Please someone help! |
Hey @sahmed53 ! Have you solved it? |
When you load a CoreML model the first time it does an optimisation and it saves that optimised model somewhere. I could never figure out where – it is something internal and hidden. I suspect that's what you're seeing. Sometimes the OS will delete those files (I assume when storage is low) and then when you load the CoreML model again it will do the optimisation step again. This can take very long on some devices. This is why I've stopped using CoreML for my app and I only use the Metal version. |
@bjnortier does WhisperKit suffer from the same issue? they became quite popular and rely on CoreML rather than Metal |
@aehlke Yes, if you use the WhisperKit macOS TestFlight app you will see "Specializing [...] for you device... This can take several minutes on first load" |
* coreml : use Core ML encoder inference * coreml : simlpify whisper_encode + log messages * whisper : resolve rebase conflicts * coreml : add scripts for CoreML model generation * bench-all : recognize COREML flag
Running Whisper inference on Apple Neural Engine (ANE) via Core ML
This PR extends
whisper.cpp
to run the Whisper Encoder on the ANE through Core ML inference.The performance gain is more than x3 compared to 8-thread CPU for
tiny
,base
andsmall
models.Here are initial performance benchmarks for the Encoder on M1 Pro with (top) and without (bottom) Core ML:
This PR adds a helper script models/generate-coreml-model.sh that can be used to easily generate a Core ML Encoder model yourself. For now, I don't plan on hosting the Core ML models as there is some chance that the implementation will change in the future. Therefore, it is recommended that everyone simply generate them locally with that script. See the instructions below.
There are a couple of drawbacks:
All follow-up runs are fast
medium
andlarge
models take a long time to be converted to Core ML (tens of minutes) and require a lot of RAM. First run on a device is also very slow for them, so not sure if these are viable for production useAcknowledgements
Huge thanks to @wangchou for the initial demonstration of how to use Core ML in
whisper.cpp
(#548)Thanks to @RobertRiachi for optimizing for ANE execution and improving the model export process
Thanks to everyone else who participated in #548 and helped with insights, testing and ideas
Usage
Install dependencies:
Generate a Core ML model. For example, to generate a
base.en
model, use:This will generate the folder
models/ggml-base.en-encoder.mlmodelc
Build
whisper.cpp
with Core ML support:Run the examples as usual. For example:
The first run on a device is slow, since the ANE service compiles the Core ML model to some device-specific format.
Next runs are faster.
TODO
Answer: Yes, but it is slow
medium
andlarge
models to Core ML format and upload to HFNeed a Mac Silicon with 64GB RAM to do the conversion from PyTorch -> Core ML
Does not seem viable - too slow
ggml
+coreml
model fileWe currently load both the full
ggml
model (encoder + decoder) and thecoreml
encoder - not optimalWill be done in the future, hopefully via community contributions
Currently we support only loading from a folder on the disk
Low-prio, hoping for contributions
Does not look possible. Any CoreML experts?
Not needed - the Encoder compute buffer is less than 20MB even for the
large
modelDoes not look possible. Any CoreML experts?
medium
model takes more than 30 minutes to convert on the first run. Is there a work-around?I think no
Looks like not worth it
Future work
whisper.cpp
and the transcription gets corrupted. The optimized version should be about 1.5x faster than the original oneggml
models. This will avoid having to store the Encoder data 2 times on disk / memory. Currently, it is store one time in theggml
model and another time in the Core ML model. This will reduce both disk and memory usage