-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoreML Model Loading and Caching ("First Time" Init) everytime? #67
Comments
I don't have any experience with macOS at all, so I'm not sure about this at all. Hopefully someone else with more experience can shine a light on this. |
I believe this is a processor-dependent implementation issue on Apple's side, not a programming language API issue. If you look at the execution output in your original post, it shows it is loading On my Mac Mini M2 Pro, running the medium model behaves as expected - it takes a minute or so the first time, then just a few seconds afterwards. Future runs usually load quickly even if I make changes to the calling process. On the other hand, on my MacBook Air M1, the small model works as expected but the medium model usually hangs indefinitely after the "first run on a device may take a while ...". I usually don't wait long enough to find out if will eventually load. In ggerganov/whisper.cpp#773, a user notes that force-quitting My understanding is that on initial load, ANECompilerService will load the compiled CoreML model and specialize if for the specific processor being used and cache that for future use. I am guessing that for each Apple Silicon processor, there is a maximum model size above which this just doesn't work reliably. I don't know if it is something inherent to the compilation process or just a bug, but either way it is up to Apple to fix it or document the limits. |
Thanks for sharing your thoughts in such a detailed manner! It's the first time I've come across a reasonable explanation for this. Quite frustrating that we're having to poke around and infer things from the dark due to a complete lack of documentation from Apple's side. Could you also clarify on the memory that your devices have? I would also be interested to know how well your M2Pro performs in the benchmark results using CoreML I'm considering upgrading to an M2Pro/M2Max if the performance is good enough as both llama.cpp and whisper.cpp seems to support coreml now, but I'm looking for solid numbers. The above link only has numbers for M2. |
When building using -F coreml and running the audio_transcription example, you see the following message
And then one needs to wait for an agonizing amount of time (3.5h for the medium model on my M1 Mac Mini) for the first time load.
After this, it's noticeably faster as compared to CPU only execution.
However the catch comes in when I realized that this seems to be per-process (which I did not expect), and per-model (which I did).
So if I make some changes to the code and recompile, the "first-time run" will again take a while.
To me this seems to be a consequence of not passing some kind of cache-hint to the Apple XCode CoreML ANE Compiler Service or whatever, surely there must be some way to make it remember this cached model and to avoid having to recompute this?
Looking at the Apple official documentation at https://github.com/apple/ml-stable-diffusion#faq I see the following:
"and using the Swift library reduces this to just a few seconds", it seems absurd for the CoreML pipeline caching feature to be dependent on what programming language you call this from. Can I not use this within Rust then?
The text was updated successfully, but these errors were encountered: