Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

models : Fix n_mel mismatch in convert-whisper-to-coreml.py #1457

Closed
wants to merge 1 commit into from

Conversation

bobqianic
Copy link
Collaborator

Because whisper-large-v3 uses n_mel=128, we cannot hardcode the input_shape = (1, 80, 3000) into the script.

@bobqianic bobqianic linked an issue Nov 8, 2023 that may be closed by this pull request
@piotr-sikora-v
Copy link

after that I have some errors in other files, and also sample transcoding give strange result.

converting:

# ./models/generate-coreml-model.sh large
ModelDimensions(n_mels=128, n_audio_ctx=1500, n_audio_state=1280, n_audio_head=20, n_audio_layer=32, n_vocab=51866, n_text_ctx=448, n_text_state=1280, n_text_head=20, n_text_layer=32)
/opt/homebrew/lib/python3.11/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
Converting PyTorch Frontend ==> MIL Ops: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 2613/2614 [00:01<00:00, 2302.76 ops/s]
Running MIL frontend_pytorch pipeline: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 34.79 passes/s]
Running MIL default pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66/66 [00:17<00:00,  3.86 passes/s]
Running MIL backend_mlprogram pipeline: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 378.86 passes/s]
done converting
/xxxx/whisper.cpp/models/coreml-encoder-large.mlmodelc/coremldata.bin
models/coreml-encoder-large.mlmodelc -> models/ggml-large-encoder.mlmodelc

and test:


===============================================
Running large on all samples in ./samples ...
===============================================

----------------------------------------------
[+] Running large on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load: model ctx     = 2951.63 MB
whisper_model_load: model size    = 2951.01 MB
whisper_init_state: kv self size  =   70.00 MB
whisper_init_state: kv cross size =  234.38 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
whisper_init_state: compute buffer (conv)   =   10.35 MB
whisper_init_state: compute buffer (cross)  =    8.89 MB
whisper_init_state: compute buffer (decode) =   59.40 MB
whisper_init_state: Metal context initialized
whisper_init_state: max tensor size =   126.63 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 1 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:29.980]   Thank you.


whisper_print_timings:     load time =  1070.39 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     7.04 ms
whisper_print_timings:   sample time =     1.75 ms /     5 runs (    0.35 ms per run)
whisper_print_timings:   encode time =   166.97 ms /     1 runs (  166.97 ms per run)
whisper_print_timings:   decode time =    79.27 ms /     4 runs (   19.82 ms per run)
whisper_print_timings:   prompt time =    40.05 ms /     1 runs (   40.05 ms per run)
whisper_print_timings:    total time =  4032.09 ms

I also try base to verify, and it works good... convert show me same warning as above, but test work without problem:

convert:

# ./models/generate-coreml-model.sh base
ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=512, n_audio_head=8, n_audio_layer=6, n_vocab=51865, n_text_ctx=448, n_text_state=512, n_text_head=8, n_text_layer=6)
/opt/homebrew/lib/python3.11/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
Converting PyTorch Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 533/534 [00:00<00:00, 11082.02 ops/s]
Running MIL frontend_pytorch pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 620.42 passes/s]
Running MIL default pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66/66 [00:00<00:00, 89.21 passes/s]
Running MIL backend_mlprogram pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 1899.38 passes/s]
done converting
/xxxxx/whisper.cpp/models/coreml-encoder-base.mlmodelc/coremldata.bin
models/coreml-encoder-base.mlmodelc -> models/ggml-base-encoder.mlmodelc

and test:


----------------------------------------------
[+] Running base on samples/jfk.wav ... (run 'ffplay samples/jfk.wav' to listen)
----------------------------------------------

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load: model ctx     =  140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: loading Core ML model from 'models/ggml-base-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
whisper_init_state: compute buffer (conv)   =    5.41 MB
whisper_init_state: compute buffer (cross)  =    4.49 MB
whisper_init_state: compute buffer (decode) =   24.70 MB
whisper_init_state: Metal context initialized
whisper_init_state: max tensor size =    50.65 MB

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 1 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =    62.11 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     6.13 ms
whisper_print_timings:   sample time =    10.64 ms /    27 runs (    0.39 ms per run)
whisper_print_timings:   encode time =   458.16 ms /     1 runs (  458.16 ms per run)
whisper_print_timings:   decode time =    54.93 ms /    26 runs (    2.11 ms per run)
whisper_print_timings:   prompt time =     4.92 ms /     1 runs (    4.92 ms per run)
whisper_print_timings:    total time =   755.12 ms

@bobqianic
Copy link
Collaborator Author

after that I have some errors in other files, and also sample transcoding give strange result.

Yes, it looks like the n_mel parameter is hardcoded in the whisper-encoder.mm. We ought to address that too.

void whisper_coreml_encode(
const whisper_coreml_context * ctx,
float * mel,
float * out) {
MLMultiArray * inMultiArray = [
[MLMultiArray alloc] initWithDataPointer: mel
shape: @[@1, @80, @3000]
dataType: MLMultiArrayDataTypeFloat32
strides: @[@(240000), @(3000), @1]
deallocator: nil
error: nil
];

@piotr-sikora-v
Copy link

#1458 - this working for me.
Also changing @80 to @128 works too (after recompile)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't generate large-v3 Core ML model
2 participants