Implement word-level timestamps approach proposed by OpenAI #375

ggerganov · 2023-01-05T19:57:18Z

See notebook, section "Word-level timestamps using attention weights":

https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb

djthorpe · 2023-01-16T17:19:09Z

This came up in my "explore" feed as a way to implement accurate word-level timestamps:
https://github.com/m-bain/whisperX

kamranjon · 2023-01-31T00:23:57Z

There is a functioning implementation of the attention weights approach here: https://github.com/linto-ai/whisper-timestamped which might be a useful reference for implementing in whisper.cpp eventually.

HaujetZhao · 2023-03-23T03:25:22Z

The whisper python module itself provided a time-stamp output option, which could be a reference, and I tested it, the command is:

python -m whisper --model tiny --language en --word_timestamps True --output_dir "test_out" "test.wav"

it generated 5 files in the test_out folder:

test.json
test.txt
test.srt
test.vtt
test.tsv

In the test.json file, the content is:

{
    "text": " So, here's a great city of New York, and I realized now, going out public is a big, busy place. I'm going to be recognized, people are going to know why I am. Now I'm here, I'm on vacation, I'm with my family, I just want to have the money back. I just want to be a normal person, so... I'm going to go to the kitchen. I'm at this girl's girl, she seemed to hit me tight. The need to surface she was more about her perfect life. It's not the best thing, she drives the main thing. And when I'm dreaming her to scream and daddy make it. She's a gold digger lover, she got it from her mom. She's never stepfather body, all they should want. She's a gold digger lover, she's a gold digger lover. If you recognize me now, don't you? I'm the only one. So, real life, nobody really knows who the heck I am. So, I have a plan, but I gotta make myself known. I gotta do this somehow, I gotta get my name out there. She's a gold digger lover, she's a gold digger lover. She's a gold digger lover, she's a gold digger lover. Have you heard of her? You never heard of her? Oh, it's great. She won't do anything like gold in time and fame. But pop up on a mark, the birds will fade away with a ramycine. Last blow, sing, last tip of two, she could last. By the way, do you know what's going to be? Can I get caught by a goodness? I don't know. She's a gold digger lover, she's been on the cover, she's a brand. No, pop up on a mark, you just look her up, she's great. She's a gold digger lover, she's a gold digger lover, she's a gold digger lover. Thank you. Thank you. Thank you. I thought it was a party party party, can you? So, okay, New York City, you may not know me yet. And all, as I've learned, you may not have heard of Lindsey Sterling, he hit my violinist before. What? Think you're looking up for me. Think you're on the bright side. Hello, how you doing? Yes. Okay. So, subscribe to my YouTube channel. Stop it this drought. I'm just gonna be fine. I got some great stuff coming through away. Do. Ace that. More come. Yeah. Let me surely sign me out.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 3.6,
            "end": 10.6,
            "text": " So, here's a great city of New York, and I realized now, going out public is a big, busy place.",
            "tokens": [
                50364, 407, 11, 510, 311, 257, 869, 2307, 295, 1873, 3609, 11, 293, 286, 5334, 586, 11, 516, 484, 1908, 307, 257, 955, 11, 5856, 1081, 13, 50914
            ],
            "temperature": 0.0,
            "avg_logprob": -0.5041744733097577,
            "compression_ratio": 1.5665024630541873,
            "no_speech_prob": 0.08891408145427704,
            "words": [
                {"word": " So,","start": 3.6,"end": 3.96,"probability": 0.5301069021224976},
                {"word": " here's","start": 3.42,"end": 4.32,"probability": 0.6140210628509521},
                {"word": " a","start": 4.32,"end": 4.42,"probability": 0.1545887440443039},
                {"word": " great","start": 4.42,"end": 4.7,"probability": 0.6114427447319031},
                {"word": " city","start": 4.7,"end": 5.08,"probability": 0.9124268293380737},
                {"word": " of","start": 5.08,"end": 5.36,"probability": 0.9507943987846375},
                {"word": " New","start": 5.36,"end": 5.44,"probability": 0.9982349872589111},
                {"word": " York,","start": 5.44,"end": 6.18,"probability": 0.9951660633087158},
                {"word": " and","start": 6.44,"end": 6.56,"probability": 0.9580233097076416},
                {"word": " I","start": 6.56,"end": 6.66,"probability": 0.5875958204269409},
                {"word": " realized","start": 6.66,"end": 7.02,"probability": 0.5471060872077942},
                {"word": " now,","start": 7.02,"end": 7.86,"probability": 0.6020179390907288},
                {"word": " going","start": 8.04,"end": 8.12,"probability": 0.7494494318962097},
                {"word": " out","start": 8.12,"end": 8.38,"probability": 0.9883183240890503},
                {"word": " public","start": 8.38,"end": 8.72,"probability": 0.6699197888374329},
                {"word": " is","start": 8.72,"end": 8.98,"probability": 0.3241350054740906},
                {"word": " a","start": 8.98,"end": 9.14,"probability": 0.7641012072563171},
                {"word": " big,","start": 9.14,"end": 9.5,"probability": 0.4375719726085663},
                {"word": " busy","start": 9.5,"end": 9.94,"probability": 0.6939781308174133},
                {"word": " place.","start": 9.94,"end": 10.6,"probability": 0.8924348950386047}
            ]
        },
        {
            "id": 1,
            "seek": 0,
            "start": 11.7,
            "end": 15.16,
            "text": " I'm going to be recognized, people are going to know why I am.",
            "tokens": [
                50914, 286, 478, 516, 281, 312, 9823, 11, 561, 366, 516, 281, 458, 983, 286, 669, 13, 51114
            ],
            "temperature": 0.0,
            "avg_logprob": -0.5041744733097577,
            "compression_ratio": 1.5665024630541873,
            "no_speech_prob": 0.08891408145427704,
            "words": [
                {"word": " I'm","start": 11.7,"end": 11.8,"probability": 0.980172872543335},
                {"word": " going","start": 11.8,"end": 11.94,"probability": 0.32428041100502014},
                {"word": " to","start": 11.94,"end": 12.04,"probability": 0.9828474521636963},
                {"word": " be","start": 12.04,"end": 12.16,"probability": 0.9843984842300415},
                {"word": " recognized,","start": 12.16,"end": 12.58,"probability": 0.3810001611709595},
                {"word": " people","start": 13.22,"end": 13.5,"probability": 0.9561352729797363},
                {"word": " are","start": 13.5,"end": 13.6,"probability": 0.9821558594703674},
                {"word": " going","start": 13.6,"end": 13.78,"probability": 0.7550729513168335},
                {"word": " to","start": 13.78,"end": 13.8,"probability": 0.9977655410766602},
                {"word": " know","start": 13.8,"end": 14.0,"probability": 0.9933110475540161},
                {"word": " why","start": 14.0,"end": 14.32,"probability": 0.7471684813499451},
                {"word": " I","start": 14.32,"end": 14.58,"probability": 0.31861186027526855},
                {"word": " am.","start": 14.58,"end": 15.16,"probability": 0.9440820217132568}
            ]
        }
    ],
    "language": "en"
}

From a practicle view, the json word-timestamp file is quite useful.

bmurray · 2023-07-25T23:11:58Z

The method used to get per-word timestamps is pretty bad. The python version is substantially better. I'm struggling to sort out how to do it in the Whisper.CPP version, but it seems like "whisper_exp_compute_token_level_timestamps" needs to be replaced with something similar to what's in the "timing.py" of OpenAI's implementation.

iceychris · 2023-08-04T16:51:24Z

I'd love to help with implementing OpenAI's per-word timestamps approach based on DTW and cross-attention weights in whisper.cpp.

I think the main steps required for this consist of:

implementing DTW transform in ggml (or whisper.cpp)
collecting all the things (cross-attention weights, tokens, alignment heads) from the right places
and writing a new top-level function like whisper_compute_word_alignment containing the logic of this function

Is this on the roadmap and is anyone willing to collaborate on this?

bmurray · 2023-08-04T20:07:40Z

I think the roadmap is pretty open to whatever you want to contribute. I don't know of anyone else working on it.

I did take a look at trying to implement it, but found that I just don't know the inner workings of GGML and PyTorch well enough to build something that won't be a total mess. I'm definitely willing to collaborate on it, but I'm not sure how much use I can be.

ggerganov · 2023-08-06T08:19:14Z

Would be great to implement this in whisper.cpp and I want to give it a try, but I won't be able to work on this anytime soon as there are more things with higher priority in llama.cpp. If anyone is interested - please open a PR and we can discuss the implementation.

From what I remember, DTW is a dynamic programming algorithm and it's implementation should be part of whisper.cpp. Can be implemented as a first step with some unit tests to make sure it works correctly.

denersc · 2023-10-16T19:36:14Z

I would like to try my hand at this, would you be willing to offer me some guidance @ggerganov ?

I'll probably start as suggested, implementing the DTW algorithm (on whisper.cpp file, correct?) and some tests (maybe a dtw.cpp in tests folder? I'm open to suggestions.). I'll create a PR as soon as i have DTW figured out so we can go from there.

What i will probably need help figuring out is the information collection. Two points in special trouble me:

We need to retrieve the output of the decoder cross-attention layers. How hard would it be to cache these outputs when executing inference (e.g. saving them on whisper_state) so they could be used when computing our timestamps? e.g. in some whisper_compute_token_level_timestamps function ran at the end of whisper_full_with_state
In the original OpenAI impl they have hard-coded boolean arrays for each model size that indicate which cross-attention heads are highly correlated with timing (i.e. the alignment heads). Apparently these are the only cross-attention outputs actually used when computing DTW

# base85-encoded (n_layers, n_heads) boolean arrays indicating the cross-attention heads that are
# highly correlated to the word-level timing, i.e. the alignment between audio and text tokens.
_ALIGNMENT_HEADS = {
    "tiny.en": b"ABzY8J1N>@0{>%R00Bk>$p{7v037`oCl~+#00",
    "tiny": b"ABzY8bu8Lr0{>%RKn9Fp%m@SkK7Kt=7ytkO",
    "base.en": b"ABzY8;40c<0{>%RzzG;p*o+Vo09|#PsxSZm00",
    "base": b"ABzY8KQ!870{>%RzyTQH3`Q^yNP!>##QT-<FaQ7m",
    "small.en": b"ABzY8>?_)10{>%RpeA61k&I|OI3I$65C{;;pbCHh0B{qLQ;+}v00",
    "small": b"ABzY8DmU6=0{>%Rpa?J`kvJ6qF(V^F86#Xh7JUGMK}P<N0000",
    "medium.en": b"ABzY8usPae0{>%R7<zz_OvQ{)4kMa0BMw6u5rT}kRKX;$NfYBv00*Hl@qhsU00",
    "medium": b"ABzY8B0Jh+0{>%R7}kK1fFL7w6%<-Pf*t^=N)Qr&0RR9",
    "large-v1": b"ABzY8r9j$a0{>%R7#4sLmoOs{s)o3~84-RPdcFk!JR<kSfC2yj",
    "large-v2": b"ABzY8zd+h!0{>%R7=D0pU<_bnWW*tkYAhobTNnu$jnkEkXqp)j;w1Tzk)UH3X%SZd&fFZ2fC2yj",
    "large": b"ABzY8zd+h!0{>%R7=D0pU<_bnWW*tkYAhobTNnu$jnkEkXqp)j;w1Tzk)UH3X%SZd&fFZ2fC2yj",
}

Considering the conversion between PyTorch to ggml, would these indexes still point to the same attention heads?

mrienstra · 2024-05-15T18:40:12Z

Now that #1485 -- great work @denersc! -- has merged, seems like it would be prudent to summarize outstanding tasks needed to close this issue.

ggerganov added the enhancement New feature or request label Jan 5, 2023

ggerganov mentioned this issue Mar 22, 2023

Request: word-level timestamps in transcribe #632

Closed

HaujetZhao mentioned this issue Apr 14, 2023

Word-level timestamp method --max-len 1 works bad for CJK language. #761

Open

anuejn mentioned this issue May 7, 2023

use whispers attention weights to generate initial text-timestamps bugbakery/transcribee#139

Open

bmurray mentioned this issue Jul 26, 2023

Improve Timestamp Accuracy #958

Open

streamer45 mentioned this issue Nov 10, 2023

[MM-54242] Improve timestamp accuracy mattermost/calls-transcriber#3

Merged

denersc mentioned this issue Nov 13, 2023

[DRAFT] Token level timestamps with DTW (#375) #1485

Merged

11 tasks

aiaimimi0920 mentioned this issue Jan 16, 2024

Missing some text V-Sekai/godot-whisper#38

Closed

achimmihca mentioned this issue Jul 28, 2024

Improve timestamp accuracy Macoron/whisper.unity#95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement word-level timestamps approach proposed by OpenAI #375

Implement word-level timestamps approach proposed by OpenAI #375

ggerganov commented Jan 5, 2023

djthorpe commented Jan 16, 2023

kamranjon commented Jan 31, 2023

HaujetZhao commented Mar 23, 2023 •

edited

Loading

bmurray commented Jul 25, 2023

iceychris commented Aug 4, 2023

bmurray commented Aug 4, 2023

ggerganov commented Aug 6, 2023

denersc commented Oct 16, 2023

mrienstra commented May 15, 2024

Implement word-level timestamps approach proposed by OpenAI #375

Implement word-level timestamps approach proposed by OpenAI #375

Comments

ggerganov commented Jan 5, 2023

djthorpe commented Jan 16, 2023

kamranjon commented Jan 31, 2023

HaujetZhao commented Mar 23, 2023 • edited Loading

bmurray commented Jul 25, 2023

iceychris commented Aug 4, 2023

bmurray commented Aug 4, 2023

ggerganov commented Aug 6, 2023

denersc commented Oct 16, 2023

mrienstra commented May 15, 2024

HaujetZhao commented Mar 23, 2023 •

edited

Loading