Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV format export trims spaces #437

Closed
alex-bacart opened this issue Jan 23, 2023 · 3 comments · Fixed by #444
Closed

CSV format export trims spaces #437

alex-bacart opened this issue Jan 23, 2023 · 3 comments · Fixed by #444

Comments

@alex-bacart
Copy link
Contributor

CSV format export trims leading spaces and it's an issue. vtt and srt formats don't do it.

Command I use to transcribe audio file: ./main --model ./models/ggml-large.bin --file audio.wav --output-csv --max-len 1

Comment on this line https://github.com/ggerganov/whisper.cpp/pull/340/files#diff-2d3599a9fad195f2c3c60bd06691bc1815325b3560b5feda41a91fa71194e805R344 says every time we get a space we should remove it. It's not true in some cases when words are divided in chunks. An example of such a division:

8630, 9070, "greatest"
9070, 9230, "Pon"
9230, 9340, "zi"
9340, 9670, "scheme"
9670, 9780, "in"
9780, 10050, "human"
10050, 10480, "history"

Ponzi is a single word.
Here is the same part of transcription using srt format.

31
00:00:08,630 --> 00:00:09,070
 greatest

32
00:00:09,070 --> 00:00:09,230
 Pon

33
00:00:09,230 --> 00:00:09,340
zi

34
00:00:09,340 --> 00:00:09,670
 scheme

35
00:00:09,670 --> 00:00:09,780
 in

36
00:00:09,780 --> 00:00:10,050
 human

37
00:00:10,050 --> 00:00:10,480
 history

Every word/chunk except "zi" has a space before it and it's possible to glue it into correct sentences. Unfortunately csv format doesn't allow to do it.

An issue follows #340
cc @NielsMayer

@alex-bacart
Copy link
Contributor Author

@ggerganov BTW is it ok that whisper.cpp divides "Ponzi" into "Pon" and "zi"? Using --max-len 1 I get tons of such a chunks in transcription results.

@alex-bacart
Copy link
Contributor Author

I'm not a C++ developer but here is my try to fix it #444 take a look please.

@ggerganov
Copy link
Owner

@alex-bacart
The --max-len 1 means to output maximum 1 token per text segment.
The word " Ponzi" consists of 2 tokens: Pon and zi and therefore it is being split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants