models : Fix `n_mel` mismatch in convert-whisper-to-coreml.py #1457

bobqianic · 2023-11-08T17:39:32Z

Because whisper-large-v3 uses n_mel=128, we cannot hardcode the input_shape = (1, 80, 3000) into the script.

piotr-sikora-v · 2023-11-08T18:43:44Z

after that I have some errors in other files, and also sample transcoding give strange result.

converting:

# ./models/generate-coreml-model.sh large
ModelDimensions(n_mels=128, n_audio_ctx=1500, n_audio_state=1280, n_audio_head=20, n_audio_layer=32, n_vocab=51866, n_text_ctx=448, n_text_state=1280, n_text_head=20, n_text_layer=32)
/opt/homebrew/lib/python3.11/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
Converting PyTorch Frontend ==> MIL Ops: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 2613/2614 [00:01<00:00, 2302.76 ops/s]
Running MIL frontend_pytorch pipeline: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 34.79 passes/s]
Running MIL default pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66/66 [00:17<00:00,  3.86 passes/s]
Running MIL backend_mlprogram pipeline: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 378.86 passes/s]
done converting
/xxxx/whisper.cpp/models/coreml-encoder-large.mlmodelc/coremldata.bin
models/coreml-encoder-large.mlmodelc -> models/ggml-large-encoder.mlmodelc

and test:


===============================================
Running large on all samples in ./samples ...
===============================================

----------------------------------------------
[+] Running large on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load: model ctx     = 2951.63 MB
whisper_model_load: model size    = 2951.01 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
whisper_init_state: compute buffer (conv)   =   10.35 MB
whisper_init_state: compute buffer (cross)  =    8.89 MB
whisper_init_state: compute buffer (decode) =   59.40 MB
whisper_init_state: Metal context initialized
whisper_init_state: max tensor size =   126.63 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 1 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:29.980]   Thank you.


whisper_print_timings:     load time =  1070.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     7.04 ms
whisper_print_timings:   sample time =     1.75 ms /     5 runs (    0.35 ms per run)
whisper_print_timings:   encode time =   166.97 ms /     1 runs (  166.97 ms per run)
whisper_print_timings:   decode time =    79.27 ms /     4 runs (   19.82 ms per run)
whisper_print_timings:   prompt time =    40.05 ms /     1 runs (   40.05 ms per run)
whisper_print_timings:    total time =  4032.09 ms

I also try base to verify, and it works good... convert show me same warning as above, but test work without problem:

convert:

# ./models/generate-coreml-model.sh base
ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=512, n_audio_head=8, n_audio_layer=6, n_vocab=51865, n_text_ctx=448, n_text_state=512, n_text_head=8, n_text_layer=6)
/opt/homebrew/lib/python3.11/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 533/534 [00:00<00:00, 11082.02 ops/s]
Running MIL frontend_pytorch pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 620.42 passes/s]
Running MIL default pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66/66 [00:00<00:00, 89.21 passes/s]
Running MIL backend_mlprogram pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 1899.38 passes/s]
done converting
/xxxxx/whisper.cpp/models/coreml-encoder-base.mlmodelc/coremldata.bin
models/coreml-encoder-base.mlmodelc -> models/ggml-base-encoder.mlmodelc

and test:


----------------------------------------------
[+] Running base on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: loading Core ML model from 'models/ggml-base-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
whisper_init_state: compute buffer (conv)   =    5.41 MB
whisper_init_state: compute buffer (cross)  =    4.49 MB
whisper_init_state: compute buffer (decode) =   24.70 MB
whisper_init_state: Metal context initialized
whisper_init_state: max tensor size =    50.65 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 1 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =    62.11 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     6.13 ms
whisper_print_timings:   sample time =    10.64 ms /    27 runs (    0.39 ms per run)
whisper_print_timings:   encode time =   458.16 ms /     1 runs (  458.16 ms per run)
whisper_print_timings:   decode time =    54.93 ms /    26 runs (    2.11 ms per run)
whisper_print_timings:   prompt time =     4.92 ms /     1 runs (    4.92 ms per run)
whisper_print_timings:    total time =   755.12 ms

bobqianic · 2023-11-08T18:51:42Z

after that I have some errors in other files, and also sample transcoding give strange result.

Yes, it looks like the n_mel parameter is hardcoded in the whisper-encoder.mm. We ought to address that too.

whisper.cpp/coreml/whisper-encoder.mm

Lines 49 to 60 in baeb733

    
           void whisper_coreml_encode( 
        
                   const whisper_coreml_context * ctx, 
        
                                          float * mel, 
        
                                          float * out) { 
        
               MLMultiArray * inMultiArray = [ 
        
                   [MLMultiArray alloc] initWithDataPointer: mel 
        
                                                      shape: @[@1, @80, @3000] 
        
                                                   dataType: MLMultiArrayDataTypeFloat32 
        
                                                    strides: @[@(240000), @(3000), @1] 
        
                                                deallocator: nil 
        
                                                      error: nil 
        
               ];

piotr-sikora-v · 2023-11-08T19:03:59Z

#1458 - this working for me.
Also changing @80 to @128 works too (after recompile)

Update convert-whisper-to-coreml.py

1ad2701

bobqianic linked an issue Nov 8, 2023 that may be closed by this pull request

Can't generate large-v3 Core ML model #1454

Closed

bobqianic closed this Nov 8, 2023

bobqianic deleted the coreml-fix branch November 8, 2023 20:07

bobqianic mentioned this pull request Nov 8, 2023

models : Fix n_mel mismatch in convert-whisper-to-openvino.py #1459

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models : Fix `n_mel` mismatch in convert-whisper-to-coreml.py #1457

models : Fix `n_mel` mismatch in convert-whisper-to-coreml.py #1457

bobqianic commented Nov 8, 2023

piotr-sikora-v commented Nov 8, 2023

bobqianic commented Nov 8, 2023

piotr-sikora-v commented Nov 8, 2023

models : Fix n_mel mismatch in convert-whisper-to-coreml.py #1457

models : Fix n_mel mismatch in convert-whisper-to-coreml.py #1457

Conversation

bobqianic commented Nov 8, 2023

piotr-sikora-v commented Nov 8, 2023

bobqianic commented Nov 8, 2023

piotr-sikora-v commented Nov 8, 2023

models : Fix `n_mel` mismatch in convert-whisper-to-coreml.py #1457

models : Fix `n_mel` mismatch in convert-whisper-to-coreml.py #1457