whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize #1058

akashmjn · 2023-06-27T18:16:01Z

As discussed in #64, this PR adds experimental support for local diarization (marking of speaker turns) via integration of checkpoints from this project https://github.com/akashmjn/tinydiarize/tree/main.

This is an early functional prototype done for the small.en models.

@ggerganov - this should be functionally done save for the last two points on the checklist, for which i'd appreciate some comments on the right way to expose this.

(also please excuse my C++ , I haven't written a lot of it, so this is heavily copilot-assisted 😉 )

Example usage

make
./models/download-ggml-model.sh small.en-tdrz

make samples
./main -m models/ggml-small.en-tdrz.bin -f samples/a13.wav

After running the above, you should see this:

JSON output contains an extra speaker_turn_next field for each segment with this information.

Example JSON output

{
	"systeminfo": "AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | COREML = 0 | ",
	"model": {
		"type": "small",
		"multilingual": false,
		"vocab": 51864,
		"audio": {
			"ctx": 1500,
			"state": 768,
			"head": 12,
			"layer": 12
		},
		"text": {
			"ctx": 448,
			"state": 768,
			"head": 12,
			"layer": 12
		},
		"mels": 80,
		"ftype": 1
	},
	"params": {
		"model": "models/whisper-small.en.tdrz/ggml-small.en-tdrz.bin",
		"language": "en",
		"translate": false
	},
	"result": {
		"language": "en"
	},
	"transcription": [
		{
			"timestamps": {
				"from": "00:00:00,000",
				"to": "00:00:03,800"
			},
			"offsets": {
				"from": 0,
				"to": 3800
			},
			"text": " Okay Houston, we've had a problem here. [SPEAKER TURN]"
			"speaker_turn_next": true
		},
                ...
	]
}

Checklist:

script for conversion of pytorch checkpoints Make convert-pt-to-ggml.py backwards compatible with older vocab.json tokenizer files #1001
updated download script with hosted of checkpoints akashmjn@7f0dc9b
fixing a bug in the repo where translate/transcribe token_ids were incorrect for .en models akashmjn@c8e1ed6 [I can separate this out into another PR if you like]
inference code changes for decoding of speaker turn tokens
recording speaker turns in all outputs
expose configurable behaviour to user via flag
resolve consistency with existing --diarize flag

Some terminology context for the last two points: this is technically not complete diarization yet, but speaker segmentation https://www.perplexity.ai/search/d01e6743-d2dc-4f5e-b5c2-2bf2212068f7?s=u (which can be thought of as local diarization).
Also technically the stereo audio input used by the current --diarize flag is already diarized (as it is separated into individual channels), so the naming isn't strictly consistent here either?

….id_to_token

…output

JianbangZ · 2023-06-28T15:22:46Z

Does this support multi language or just English?

skye-repos · 2023-06-29T01:36:48Z

Excited! Will this support multiple speaker labelling or will it just mark speaker turns?

akashmjn · 2023-06-30T19:17:25Z

Hi @Harith163 and @JianbangZ:

at the moment, just speaker turns and no clustering
this PR is merging a PoC done for the small.en models, so English-only

Both of these are doable I think, but are a little more involved and honestly depends on how the project evolves.

For multilingual - I think its easiest done by OpenAI themselves since ultimately that boils down to a reasonably multilingual finetuning dataset, and I'm pretty sure all released Whisper models had a final finetuning stage.

I'd say clustering has less dependencies and is a bit more tractable. I will sketch a rough plan for that once a few immediate things are done.

You can take a look at the immediate roadmap over at https://github.com/akashmjn/tinydiarize/tree/main#roadmap.

akashmjn · 2023-06-30T19:37:32Z

In fact @ggerganov I notice that you've already implemented C-means by hand in cpp here #130 😅 . Once I free up a little, I'll try running some clustering experiments over on the python repo.

In the meantime if you are interested, this is the best method out there NME-SC:

walkthrough from slide 10 onwards here
implementation: mostly standard matrix/linalg operations + k-means

ggerganov · 2023-06-30T19:43:06Z

Yes :) Felt like doing some experiments (I cannot guarantee correctness of that implementation)

Btw, will be reviewing the PR over the weekend. Adding a diarization flag should be easy

whisper.cpp

akashmjn · 2023-07-02T08:48:37Z

Yes :) Felt like doing some experiments (I cannot guarantee correctness of that implementation)

Btw, will be reviewing the PR over the weekend. Adding a diarization flag should be easy

Sounds good! For the last two points on my checklist - for now, i'll wait for your review. I've left //TODO@Akash at places where the behaviour needs to be toggled. If you find it more efficient - free to directly modify the PR however you find it best to expose this feature.

I think it should just be clear to the user that this is an experimental feature and requires using a specific *.tdrz checkpoint.

ggerganov · 2023-07-02T19:15:21Z

I synced latests ggml from llama.cpp and tomorrow will add the config option for tinydiarize and merge

ohmguru · 2023-07-03T17:13:52Z

Excited to see this PR merged. Noticed that this PR doesn't yet support the word-level timestamp flag. I wanted to flag that for consideration as Word level timestamps are quite helpful when building applications that show diarization output.

ggerganov · 2023-07-03T17:35:53Z

@akashmjn

This should be ready to merge now. Please take a look at my changes and let me know if you agree.
For now, lets leave the stereo "diairze" flag as it is - will rename it later to reflect what it actually does.

The most important change is that I added token_tdrz and kept token_solm as it is.

Also, you now have to add the -tdrz flag to explicitly enable speaker turn detection even when using tindiarize models.
The flag should not do anything if the model used is not a tinydiarize one.

$ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -tdrz

main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...

[00:00:00.000 --> 00:00:03.800]   Okay Houston, we've had a problem here. [SPEAKER_TURN]
[00:00:03.800 --> 00:00:06.200]   This is Houston. Say again please. [SPEAKER_TURN]
[00:00:06.200 --> 00:00:08.260]   Uh Houston we've had a problem.
[00:00:08.260 --> 00:00:11.320]   We've had a main beam up on a volt. [SPEAKER_TURN]
[00:00:11.320 --> 00:00:13.820]   Roger main beam interval. [SPEAKER_TURN]
[00:00:13.820 --> 00:00:15.100]   Uh uh [SPEAKER_TURN]
[00:00:15.100 --> 00:00:18.020]   So okay stand, by thirteen we're looking at it. [SPEAKER_TURN]
[00:00:18.020 --> 00:00:25.740]   Okay uh right now uh Houston the uh voltage is uh is looking good um.
[00:00:27.620 --> 00:00:29.940]   And we had a a pretty large bank or so.

Here is without it:

$ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin

main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:03.760]   Okay Houston, we've had a problem here.
[00:00:03.760 --> 00:00:08.340]   Uh Houston we've had a problem.
[00:00:08.340 --> 00:00:11.320]   We've had a main beam up on a volt.
[00:00:11.320 --> 00:00:13.760]   Roger main beam interval.
[00:00:13.760 --> 00:00:17.960]   So okay stand, by thirteen we're looking at it.
[00:00:17.960 --> 00:00:25.740]   Okay uh right now uh Houston the uh voltage is uh is looking good um.
[00:00:27.620 --> 00:00:29.940]   And we had a a pretty large bank or so.

Here is word-level timestamps with speaker turn detection:

$ ./main -f ./samples/a13.wav -m ./models/ggml-small.en-tdrz.bin -ml 1 -sow -tdrz

main: processing './samples/a13.wav' (480000 samples, 30.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, tdrz = 1, timestamps = 1 ...

[00:00:00.000 --> 00:00:00.060]  
[00:00:00.060 --> 00:00:00.500]   Okay
[00:00:00.500 --> 00:00:01.340]   Houston,
[00:00:01.340 --> 00:00:01.850]   we've
[00:00:01.850 --> 00:00:02.160]   had
[00:00:02.160 --> 00:00:02.260]   a
[00:00:02.260 --> 00:00:02.990]   problem
[00:00:02.990 --> 00:00:03.800]   here. [SPEAKER_TURN]
[00:00:03.800 --> 00:00:04.030]   This
[00:00:04.030 --> 00:00:04.140]   is
[00:00:04.140 --> 00:00:04.710]   Houston.
[00:00:04.710 --> 00:00:04.880]   Say
[00:00:04.880 --> 00:00:05.170]   again
[00:00:05.170 --> 00:00:06.200]   please. [SPEAKER_TURN]
[00:00:06.200 --> 00:00:06.340]   Uh
[00:00:06.340 --> 00:00:06.850]   Houston
[00:00:06.850 --> 00:00:07.210]   we've
[00:00:07.210 --> 00:00:07.430]   had
[00:00:07.430 --> 00:00:07.530]   a
[00:00:07.530 --> 00:00:08.260]   problem.
[00:00:08.260 --> 00:00:08.770]   We've
[00:00:08.770 --> 00:00:09.080]   had
[00:00:09.080 --> 00:00:09.180]   a
[00:00:09.180 --> 00:00:09.610]   main
[00:00:09.610 --> 00:00:10.000]   beam
[00:00:10.000 --> 00:00:10.200]   up
[00:00:10.200 --> 00:00:10.400]   on
[00:00:10.400 --> 00:00:10.500]   a
[00:00:10.500 --> 00:00:11.320]   volt. [SPEAKER_TURN]
[00:00:11.320 --> 00:00:11.840]   Roger
[00:00:11.840 --> 00:00:12.250]   main
[00:00:12.250 --> 00:00:12.740]   beam
[00:00:12.740 --> 00:00:13.820]   interval. [SPEAKER_TURN]
[00:00:13.820 --> 00:00:15.080]   Uh
[00:00:15.080 --> 00:00:15.100]   uh [SPEAKER_TURN]
[00:00:15.100 --> 00:00:15.230]   So
[00:00:15.230 --> 00:00:15.500]   okay
[00:00:15.500 --> 00:00:15.970]   stand,
[00:00:15.970 --> 00:00:16.100]   by
[00:00:16.100 --> 00:00:16.660]   thirteen
[00:00:16.660 --> 00:00:16.980]   we're
[00:00:16.980 --> 00:00:17.460]   looking
[00:00:17.460 --> 00:00:17.610]   at
[00:00:17.610 --> 00:00:18.020]   it. [SPEAKER_TURN]
[00:00:18.020 --> 00:00:18.570]   Okay
[00:00:18.570 --> 00:00:18.840]   uh
[00:00:18.840 --> 00:00:19.530]   right
[00:00:19.530 --> 00:00:19.940]   now
[00:00:19.940 --> 00:00:20.210]   uh
[00:00:20.210 --> 00:00:21.170]   Houston
[00:00:21.170 --> 00:00:21.580]   the
[00:00:21.580 --> 00:00:21.850]   uh
[00:00:21.850 --> 00:00:22.810]   voltage
[00:00:22.810 --> 00:00:23.080]   is
[00:00:23.080 --> 00:00:23.400]   uh
[00:00:23.400 --> 00:00:23.730]   is
[00:00:23.730 --> 00:00:24.810]   looking
[00:00:24.810 --> 00:00:25.440]   good
[00:00:25.440 --> 00:00:25.740]   um.
[00:00:27.620 --> 00:00:27.670]  
[00:00:27.670 --> 00:00:27.840]   And
[00:00:27.840 --> 00:00:27.980]   we
[00:00:27.980 --> 00:00:28.210]   had
[00:00:28.210 --> 00:00:28.270]   a
[00:00:28.270 --> 00:00:28.340]   a
[00:00:28.340 --> 00:00:28.780]   pretty
[00:00:28.780 --> 00:00:29.150]   large
[00:00:29.150 --> 00:00:29.440]   bank
[00:00:29.440 --> 00:00:29.580]   or
[00:00:29.580 --> 00:00:29.940]   so.

akashmjn

Added some comments relating to some tricky token ID stuff

whisper.cpp

Enables tinydiarize models ggerganov/whisper.cpp#1058

tingyuchang · 2023-10-13T08:01:07Z

@karolszafranski I think no need any special settings, set tdrz_enable to true and you can get data from whisper_full_get_segment_speaker_turn_next in each segment

…dio via tinydiarize (ggerganov#1058) * add HuggingFace mirror to download ggml model * support tdrz via simple hack overriding solm tokens * fix incorrect translate/transcribe token_ids that are not static const * add apollo 13 sample for tdrz demo * render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token * extend whisper_segment with speaker_turn_next field and save in json output * fix failing go build * slipped in some python syntax whoops * whisper : finalize tinydiarize support (add flag + fixes) * whisper : tdrz support for word-level timestamps (respect max_len) * java : try to fix tests after adding tdrz_enable flag * main : remove TODO leftover * java : fix params order list after adding "tdrz_enable" * whisper : fix solm and add nosp token * main : print tinydiarize help --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

khimaros · 2023-12-01T23:18:23Z

i'm not sure if this is expected, but with medium.en-q5_0, i'm seeing that speaker turns are pretty reliably marked with >>. i'm not using the --diarize or --tdrz flags.

i wasn't seeing this behavior with large-v2, large-v3, or large-v3-q5_0. any thoughts on why that would be happening?

…dio via tinydiarize (ggerganov#1058) * add HuggingFace mirror to download ggml model * support tdrz via simple hack overriding solm tokens * fix incorrect translate/transcribe token_ids that are not static const * add apollo 13 sample for tdrz demo * render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token * extend whisper_segment with speaker_turn_next field and save in json output * fix failing go build * slipped in some python syntax whoops * whisper : finalize tinydiarize support (add flag + fixes) * whisper : tdrz support for word-level timestamps (respect max_len) * java : try to fix tests after adding tdrz_enable flag * main : remove TODO leftover * java : fix params order list after adding "tdrz_enable" * whisper : fix solm and add nosp token * main : print tinydiarize help --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

rben01 · 2024-01-31T16:55:33Z

Is there a way to use this with coreml models?

whisper_init_from_file_with_params_no_state: loading model from './models/ggml-small.en-tdrz.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 768
whisper_model_load: n_audio_head  = 12
whisper_model_load: n_audio_layer = 12
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 768
whisper_model_load: n_text_head   = 12
whisper_model_load: n_text_layer  = 12
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 3 (small)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   464.64 MiB, (  466.27 / 10922.67)
whisper_model_load:    Metal buffer size =   487.20 MB
whisper_model_load: model size    =  487.00 MB
whisper_backend_init: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: error: could not use bundle path to find ggml-metal.metal, falling back to trying cwd
ggml_metal_init: loading 'ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    47.25 MiB, (  513.52 / 10922.67)
whisper_init_state: kv self size  =   49.55 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =    52.73 MiB, (  566.25 / 10922.67)
whisper_init_state: kv cross size =   55.30 MB
whisper_init_state: loading Core ML model from './models/ggml-small.en-tdrz-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: failed to load Core ML model from './models/ggml-small.en-tdrz-encoder.mlmodelc'
ggml_metal_free: deallocating
error: failed to initialize whisper context

barolo · 2024-02-09T00:09:58Z

i'm not sure if this is expected, but with medium.en-q5_0, i'm seeing that speaker turns are pretty reliably marked with >>. i'm not using the --diarize or --tdrz flags.

i wasn't seeing this behavior with large-v2, large-v3, or large-v3-q5_0. any thoughts on why that would be happening?

It also happens with the small model, on its own or when pushed via ">>" prompt. Unfortunately, for the life of me I cannot combine it with my other prompt which resulted with proper quote-unquote behavior., i.e.

Knock on the door and I had to be like, "Oh my God, please, is there anybody in there?"
And she was like, "Okay, let's see how this goes"

And quotes only happen when using -oved GPU [unfortunately it hallucinates a lot], where -oved CPU is much likely to trigger ">>" diarizations on its own.
This is so weird...

kuro337 · 2024-05-27T18:57:47Z

Hello!

I was wondering - how does the integration work with the ./server?

because I was running - it through the binary and from the server - and it seemed the diarization output was missing.

Example:

./main -f ../audio/multi.wav -m ./models/ggml-small.en-tdrz.bin -tdrz --print-colors

# output

[00:00:00.080 --> 00:00:04.820]   Let's go down. So your sister's going off How. old is she? [SPEAKER_TURN]
[00:00:04.820 --> 00:00:08.620]   She's twenty five. [SPEAKER_TURN]
[00:00:08.620 --> 00:00:12.560]   Alright. And is she going to go to do a job or is she's gonna travel? [SPEAKER_TURN]
[00:00:12.560 --> 00:00:19.940]   Um she's going to work when she's there and do like bits of jobs and then move around at the same time. [SPEAKER_TURN]
[00:00:19.940 --> 00:00:22.520]   So is she's goin

Same example using the server

./server -m models/ggml-small.en-tdrz.bin -tdrz -pc -debug 

curl 127.0.0.1:8080/inference \
-H "Content-Type: multipart/form-data" \
-F file="@../audio/multi.wav" \
-F response_format="json" \
-F tinydiarize=true

Output

{"text":" Mm. Okay.\n So your sister's going off How. old is she?\n She's twenty five.\n Alright. And is she going to go to do a job or is she's gonna travel?\n Um she's going to work when she's there and do like bits of jobs and then move around at the same time.\n So she's going straight to Australia?\n Um no first she's going to Thailand.\n And then she's going to Australia.\n And then move somewhere and then in America.\n Brilliant. So if she's bought one of these year t tickets you can go around the world f in a year or something Is. that what she's done with these airline tickets yeah, Yeah? So would you like to travel?\n Yeah.\n Mm-hmm.\n That's a good a reason though. Yeah. Actually I think it probably is because I mean I know it sounds straight forward but you can sort of add E_s and A_s and things on the end of things and it normally sounds right anyway. We've got a Spanish girl working with us at the moment so, So this is a a two year course now is, it G_C_S_E_s?\n Yeah. It's from year ten to year E_ el" }

Tried all of these formats and the same thing -

json | text | srt | verbose_json | vtt

No issues if it is not supported - was just wondering if it was possible because the docs mentioned we can pass -tdrz to the server so was wondering if I was doing anything wrong!

Cheers

shoryamalani · 2024-07-12T17:59:12Z

Hey guys, is this functionality coming to the larger models or could we compile it ourselves?

Thank you so much

…dio via tinydiarize (ggerganov#1058) * add HuggingFace mirror to download ggml model * support tdrz via simple hack overriding solm tokens * fix incorrect translate/transcribe token_ids that are not static const * add apollo 13 sample for tdrz demo * render [SPEAKER TURN] consistently in all terminal output using vocab.id_to_token * extend whisper_segment with speaker_turn_next field and save in json output * fix failing go build * slipped in some python syntax whoops * whisper : finalize tinydiarize support (add flag + fixes) * whisper : tdrz support for word-level timestamps (respect max_len) * java : try to fix tests after adding tdrz_enable flag * main : remove TODO leftover * java : fix params order list after adding "tdrz_enable" * whisper : fix solm and add nosp token * main : print tinydiarize help --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ref: #1058

ref: ggerganov#1058

akashmjn added 6 commits June 26, 2023 02:42

add HuggingFace mirror to download ggml model

7f0dc9b

support tdrz via simple hack overriding solm tokens

62c851b

fix incorrect translate/transcribe token_ids that are not static const

c8e1ed6

add apollo 13 sample for tdrz demo

700c282

render [SPEAKER TURN] consistently in all terminal output using vocab…

4083a39

….id_to_token

extend whisper_segment with speaker_turn_next field and save in json …

713c5b6

…output

akashmjn changed the title ~~whisper: support speaker segmentation (local diarization) of mono audio via integration of tinydiarize~~ whisper: support speaker segmentation (local diarization) of mono audio via tinydiarize Jun 27, 2023

akashmjn mentioned this pull request Jun 27, 2023

whisper : mark speakers/voices (diarization) #64

Open

akashmjn added 2 commits June 27, 2023 11:26

fix failing go build

edd2348

slipped in some python syntax whoops

77825ec

sandrohanea mentioned this pull request Jul 1, 2023

Identifying two speakers sandrohanea/whisper.net#87

Closed

akashmjn commented Jul 2, 2023

View reviewed changes

whisper.cpp Outdated Show resolved Hide resolved

Merge branch 'master' into tdrz-integrate-1

12d3f90

ggerganov mentioned this pull request Jul 2, 2023

runs perfectly with the regular models, but not the quantized ones #993

Open

Merge branch 'master' into tdrz-integrate-1

024211f

ggerganov added 2 commits July 3, 2023 20:24

whisper : finalize tinydiarize support (add flag + fixes)

59e9055

whisper : tdrz support for word-level timestamps (respect max_len)

8ee5af4

ggerganov added 3 commits July 3, 2023 20:44

java : try to fix tests after adding tdrz_enable flag

5fa32da

main : remove TODO leftover

6828be7

java : fix params order list after adding "tdrz_enable"

09c32a6

akashmjn commented Jul 3, 2023

View reviewed changes

whisper.cpp Outdated Show resolved Hide resolved

whisper.cpp Outdated Show resolved Hide resolved

JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023

feat: enable tdrz

e268f94

Enables tinydiarize models ggerganov/whisper.cpp#1058

JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023

feat: enable tdrz

6de5fc8

Enables tinydiarize models ggerganov/whisper.cpp#1058

JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023

feat: enable tdrz

0496d7e

Enables tinydiarize models ggerganov/whisper.cpp#1058

JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023

feat: enable tdrz

57b29a2

Enables tinydiarize models ggerganov/whisper.cpp#1058

JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023

feat: enable tdrz

55eef01

Enables tinydiarize models ggerganov/whisper.cpp#1058

JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023

feat: enable tdrz

b67d577

Enables tinydiarize models ggerganov/whisper.cpp#1058

JEF1056 added a commit to JEF1056/whisper.rn that referenced this pull request Sep 29, 2023

feat: enable tdrz

e2999d5

Enables tinydiarize models ggerganov/whisper.cpp#1058

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

readme : add tinydiarize instructions (ggerganov#1058)

09e58c3

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023

readme : add tinydiarize instructions (ggerganov#1058)

c135703

Macoron mentioned this pull request Nov 23, 2023

Is it possible for in-game audio sources to be omitted from whisper? Macoron/whisper.unity#62

Open

nipierre mentioned this pull request Nov 26, 2023

how to use tdrz params? tazz4843/whisper-rs#80

Closed

bobqianic mentioned this pull request Dec 11, 2023

Thank you so much for this amazing tool!!! #1621

Open

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023

readme : add tinydiarize instructions (ggerganov#1058)

c83d39b

This was referenced Jan 23, 2024

Regression cases akashmjn/tinydiarize#24

Open

Cannot find diarization model #1715

Closed

Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model) #1810

Open

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024

readme : add tinydiarize instructions (ggerganov#1058)

b42a249

ggerganov pushed a commit that referenced this pull request Jan 14, 2025

ggml : allow loading backend with env variable (ggml/1059)

19c147b

ref: #1058

ggerganov pushed a commit that referenced this pull request Jan 14, 2025

ggml : allow loading backend with env variable (ggml/1059)

e875a82

ref: #1058

lyapple2008 pushed a commit to lyapple2008/whisper.cpp.mars that referenced this pull request Feb 4, 2025

ggml : allow loading backend with env variable (ggml/1059)

adc67b7

ref: ggerganov#1058

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize #1058

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize #1058

akashmjn commented Jun 27, 2023 •

edited by ggerganov

Loading

JianbangZ commented Jun 28, 2023

skye-repos commented Jun 29, 2023

akashmjn commented Jun 30, 2023 •

edited

Loading

akashmjn commented Jun 30, 2023

ggerganov commented Jun 30, 2023

akashmjn commented Jul 2, 2023 •

edited

Loading

ggerganov commented Jul 2, 2023

ohmguru commented Jul 3, 2023

ggerganov commented Jul 3, 2023

akashmjn left a comment

tingyuchang commented Oct 13, 2023

khimaros commented Dec 1, 2023

rben01 commented Jan 31, 2024

barolo commented Feb 9, 2024 •

edited

Loading

kuro337 commented May 27, 2024

shoryamalani commented Jul 12, 2024

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize #1058

whisper : support speaker segmentation (local diarization) of mono audio via tinydiarize #1058

Conversation

akashmjn commented Jun 27, 2023 • edited by ggerganov Loading

Example usage

Checklist:

JianbangZ commented Jun 28, 2023

skye-repos commented Jun 29, 2023

akashmjn commented Jun 30, 2023 • edited Loading

akashmjn commented Jun 30, 2023

ggerganov commented Jun 30, 2023

akashmjn commented Jul 2, 2023 • edited Loading

ggerganov commented Jul 2, 2023

ohmguru commented Jul 3, 2023

ggerganov commented Jul 3, 2023

akashmjn left a comment

Choose a reason for hiding this comment

tingyuchang commented Oct 13, 2023

khimaros commented Dec 1, 2023

rben01 commented Jan 31, 2024

barolo commented Feb 9, 2024 • edited Loading

kuro337 commented May 27, 2024

shoryamalani commented Jul 12, 2024

akashmjn commented Jun 27, 2023 •

edited by ggerganov

Loading

akashmjn commented Jun 30, 2023 •

edited

Loading

akashmjn commented Jul 2, 2023 •

edited

Loading

barolo commented Feb 9, 2024 •

edited

Loading