CSV format export trims spaces #437

alex-bacart · 2023-01-23T21:18:43Z

CSV format export trims leading spaces and it's an issue. vtt and srt formats don't do it.

Command I use to transcribe audio file: ./main --model ./models/ggml-large.bin --file audio.wav --output-csv --max-len 1

Comment on this line https://github.com/ggerganov/whisper.cpp/pull/340/files#diff-2d3599a9fad195f2c3c60bd06691bc1815325b3560b5feda41a91fa71194e805R344 says every time we get a space we should remove it. It's not true in some cases when words are divided in chunks. An example of such a division:

8630, 9070, "greatest"
9070, 9230, "Pon"
9230, 9340, "zi"
9340, 9670, "scheme"
9670, 9780, "in"
9780, 10050, "human"
10050, 10480, "history"

Ponzi is a single word.
Here is the same part of transcription using srt format.

31
00:00:08,630 --> 00:00:09,070
 greatest

32
00:00:09,070 --> 00:00:09,230
 Pon

33
00:00:09,230 --> 00:00:09,340
zi

34
00:00:09,340 --> 00:00:09,670
 scheme

35
00:00:09,670 --> 00:00:09,780
 in

36
00:00:09,780 --> 00:00:10,050
 human

37
00:00:10,050 --> 00:00:10,480
 history

Every word/chunk except "zi" has a space before it and it's possible to glue it into correct sentences. Unfortunately csv format doesn't allow to do it.

An issue follows #340
cc @NielsMayer

The text was updated successfully, but these errors were encountered:

alex-bacart · 2023-01-23T21:34:14Z

@ggerganov BTW is it ok that whisper.cpp divides "Ponzi" into "Pon" and "zi"? Using --max-len 1 I get tons of such a chunks in transcription results.

alex-bacart · 2023-01-24T17:26:02Z

I'm not a C++ developer but here is my try to fix it #444 take a look please.

ggerganov · 2023-02-04T06:48:13Z

@alex-bacart
The --max-len 1 means to output maximum 1 token per text segment.
The word " Ponzi" consists of 2 tokens: Pon and zi and therefore it is being split.

alex-bacart mentioned this issue Jan 24, 2023

CSV format export trimmed spaces fix #444

Merged

ggerganov closed this as completed in #444 Feb 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV format export trims spaces #437

CSV format export trims spaces #437

alex-bacart commented Jan 23, 2023

alex-bacart commented Jan 23, 2023

alex-bacart commented Jan 24, 2023

ggerganov commented Feb 4, 2023

CSV format export trims spaces #437

CSV format export trims spaces #437

Comments

alex-bacart commented Jan 23, 2023

alex-bacart commented Jan 23, 2023

alex-bacart commented Jan 24, 2023

ggerganov commented Feb 4, 2023