Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What kind of performance can we expect? #2157

Closed
genglinxiao opened this issue May 16, 2024 · 2 comments
Closed

What kind of performance can we expect? #2157

genglinxiao opened this issue May 16, 2024 · 2 comments

Comments

@genglinxiao
Copy link

I'm experimenting the streaming mode on a M2 Macbook Air and found something like 1/3 of the voice are not recognized - Is that expected or do I need more RAM or something else went wrong? I tried both medium and large_v3 modes.
Here's one of the command and its initial output:
`./stream --model models/ggml-large-v3.bin --language zh --step 0 --length 4000
init: found 2 capture devices:
init: - Capture device #0: 'MacBook Air麦克风'
init: - Capture device #1: 'Microsoft Teams Audio'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init: - sample rate: 16000
init: - format: 33056 (required: 33056)
init: - channels: 1 (required: 1)
init: - samples per frame: 1024
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/linxiaogeng/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 2951.02 MiB, ( 2952.89 / 16384.02)
whisper_model_load: Metal total size = 3094.36 MB
whisper_model_load: model size = 3094.36 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/linxiaogeng/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 210.00 MiB, ( 3163.89 / 16384.02)
whisper_init_state: kv self size = 220.20 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 234.38 MiB, ( 3398.27 / 16384.02)
whisper_init_state: kv cross size = 245.76 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 32.97 MiB, ( 3431.23 / 16384.02)
whisper_init_state: compute buffer (conv) = 36.26 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 889.44 MiB, ( 4320.67 / 16384.02)
whisper_init_state: compute buffer (encode) = 934.34 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 7.33 MiB, ( 4328.00 / 16384.02)
whisper_init_state: compute buffer (cross) = 9.38 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 197.95 MiB, ( 4525.95 / 16384.02)
whisper_init_state: compute buffer (decode) = 209.26 MB

main: processing 0 samples (step = 0.0 sec / len = 4.0 sec / keep = 0.0 sec), 4 threads, lang = zh, task = transcribe, timestamps = 1 ...
main: using VAD, will transcribe on speech activity

[Start speaking]
`

@jensdraht1999
Copy link

@genglinxiao I think, the large model v3 has some kind of bug, where it does not work properly, which cant be fixed. The medium model should be okay, but far from perfect and has never received v2 model, so the performance should not be great too. I would try large v1 and v2. Please close this issue.

@genglinxiao
Copy link
Author

Thanks. I'm closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants