What kind of performance can we expect? #2157

genglinxiao · 2024-05-16T01:26:02Z

I'm experimenting the streaming mode on a M2 Macbook Air and found something like 1/3 of the voice are not recognized - Is that expected or do I need more RAM or something else went wrong? I tried both medium and large_v3 modes.
Here's one of the command and its initial output:
`./stream --model models/ggml-large-v3.bin --language zh --step 0 --length 4000
init: found 2 capture devices:
init: - Capture device #0: 'MacBook Air麦克风'
init: - Capture device #1: 'Microsoft Teams Audio'
init: attempt to open default capture device ...
init: obtained spec for input device (SDL Id = 2):
init: - sample rate: 16000
init: - format: 33056 (required: 33056)
init: - channels: 1 (required: 1)
init: - samples per frame: 1024
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51866
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 128
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs = 100
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/linxiaogeng/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 2951.02 MiB, ( 2952.89 / 16384.02)
whisper_model_load: Metal total size = 3094.36 MB
whisper_model_load: model size = 3094.36 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/linxiaogeng/whisper.cpp/ggml-metal.metal'
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 210.00 MiB, ( 3163.89 / 16384.02)
whisper_init_state: kv self size = 220.20 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 234.38 MiB, ( 3398.27 / 16384.02)
whisper_init_state: kv cross size = 245.76 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 32.97 MiB, ( 3431.23 / 16384.02)
whisper_init_state: compute buffer (conv) = 36.26 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 889.44 MiB, ( 4320.67 / 16384.02)
whisper_init_state: compute buffer (encode) = 934.34 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 7.33 MiB, ( 4328.00 / 16384.02)
whisper_init_state: compute buffer (cross) = 9.38 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 197.95 MiB, ( 4525.95 / 16384.02)
whisper_init_state: compute buffer (decode) = 209.26 MB

main: processing 0 samples (step = 0.0 sec / len = 4.0 sec / keep = 0.0 sec), 4 threads, lang = zh, task = transcribe, timestamps = 1 ...
main: using VAD, will transcribe on speech activity

[Start speaking]
`

jensdraht1999 · 2024-09-01T07:19:39Z

@genglinxiao I think, the large model v3 has some kind of bug, where it does not work properly, which cant be fixed. The medium model should be okay, but far from perfect and has never received v2 model, so the performance should not be great too. I would try large v1 and v2. Please close this issue.

genglinxiao · 2024-09-01T07:44:20Z

Thanks. I'm closing this issue now.

genglinxiao closed this as completed Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What kind of performance can we expect? #2157

What kind of performance can we expect? #2157

genglinxiao commented May 16, 2024

jensdraht1999 commented Sep 1, 2024

genglinxiao commented Sep 1, 2024

What kind of performance can we expect? #2157

What kind of performance can we expect? #2157

Comments

genglinxiao commented May 16, 2024

jensdraht1999 commented Sep 1, 2024

genglinxiao commented Sep 1, 2024