[IMPROVEMENT] [FIX] Improve the start and end timestamps of extracted burned in captions #962

saurabhshah0410 · 2018-03-12T00:29:06Z

In raising this pull request, I confirm the following (please check boxes):

I have read and understood the contributors guide.
I have checked that another pull request for this purpose does not exist.
I have considered, and confirmed that this submission will be valuable to others.
I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
I give this submission freely, and claim no ownership to its content.

My familiarity with the project is as follows (check one):

I have never used CCExtractor.
I have used CCExtractor just a couple of times.
I absolutely love CCExtractor, but have not contributed previously.
I am an active contributor to CCExtractor.

The start and end timestamps of extracted burned in captions are flawed
and off by a large difference. Also, the start time of the first burned
in caption extracted is always zero, which is not always the case. And
the extracted captions always appear in continuous timestamps.

To see that, you can download this file from the UK TV Samples in the samples repository:
https://drive.google.com/open?id=0B_61ywKPmI0TdlRWcVdnajVJUWs

Since the duration of that file is 15 minutes, we can trim it down to 30 seconds for our purposes:
ffmpeg -i BBC1.mp4 -acodec copy -vcodec copy -scodec copy -ss 00:00:00 -t 00:00:30 bbc.mp4

This will generate the first 30 seconds of the BBC1.mp4 in bbc.mp4.
Now, before this commit, if I compile ccextractor with hard subs enabled and run the following command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60,
the generated bbc.srt(ignore the weird looking characters, that is just OCR not giving a good output I guess) is:

1
00:00:00,000 --> 00:00:08,500
Oh, no. No, lim tired.

2
00:00:08,502 --> 00:00:09,380
.‘7 -
Oh, no. No, I'm tired.

3
00:00:09,382 --> 00:00:14,380
Baby shower was rubbish.
I'm just going to go to bed.

4
00:00:14,382 --> 00:00:15,340
Are you OK? '_‘ -' _: 1:. . .. '... .

5
00:00:15,342 --> 00:00:16,500
Are you OK?

6
00:00:16,502 --> 00:00:18,340
Are you OK? ‘

7
00:00:18,342 --> 00:00:23,460
All right. Well, don't stay up
'too late. You've got a lot to do.

8
00:00:23,462 --> 00:00:28,380
I ..' Night-night.

9
00:00:28,382 --> 00:00:28,380
“Sir-ht :l 1 .

One can see that the timings are clearly flawed, and are off by a large margin.
Also, the way that the code is written, the first extracted caption will always have
a starting timestamp of 00:00:00. Also, the extracted subtitles always come one after
the other in terms of time(i.e., there is only a 2 ms gap between two consecutive captions
making it look like the whole video had hard subs in it throughout).

This commit improves the start and end timestamps of the extracted
burned in captions and reduces the error significantly, bringing the
timestamps fairly close to the actual timings as they appear in the
media file.

With these changes included, and running the command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60,
the generated bbc.srt is:

1
00:00:07,342 --> 00:00:08,500
.‘7 -
Oh, no. No, I'm tired.

2
00:00:09,382 --> 00:00:10,300
Baby shower was rubbish.
I'm just going to go to bed.

3
00:00:14,382 --> 00:00:15,340
Are you OK?

4
00:00:15,342 --> 00:00:16,500
Are you OK? ‘

5
00:00:17,382 --> 00:00:22,380
All right. Well, don't stay up*
too late. You've got a lot to do.

6
00:00:22,382 --> 00:00:23,460
I ..' Night-night.

One can see that while the OCR output is the same, the timings have improved and
are closer to the actual timings in the media file.

The start and end timestamps of extracted burned in captions are flawed and off by a large difference. Also, the start time of the first burned in caption extracted is always zero, which is not always the case. And the extracted captions always appear in continuous timestamps. This commit improves the start and end timestamps of the extracted burned in captions and reduces the error significantly, bringing the timestamps fairly close to the actual timings as they appear in the media file.

cfsmp3 · 2018-03-12T18:18:13Z

src/lib_ccx/hardsubx_decoder.c

+				if(subtitle_text) {
+					char *double_enter = strstr(subtitle_text,"\n\n");
+					if(double_enter!=NULL)
+						*(double_enter)='\0';


Are you sure this is correct? It would terminate the string on the first one, not the second one, so it removes both \n.

I haven't added this line. It was there in the source code before I created this pr.

cfsmp3 reviewed Mar 12, 2018

View reviewed changes

cfsmp3 merged commit 86356ba into CCExtractor:master Mar 12, 2018

saurabhshah0410 deleted the improvement branch March 16, 2018 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] [FIX] Improve the start and end timestamps of extracted burned in captions #962

[IMPROVEMENT] [FIX] Improve the start and end timestamps of extracted burned in captions #962

saurabhshah0410 commented Mar 12, 2018

cfsmp3 Mar 12, 2018

saurabhshah0410 Mar 13, 2018

[IMPROVEMENT] [FIX] Improve the start and end timestamps of extracted burned in captions #962

[IMPROVEMENT] [FIX] Improve the start and end timestamps of extracted burned in captions #962

Conversation

saurabhshah0410 commented Mar 12, 2018

cfsmp3 Mar 12, 2018

Choose a reason for hiding this comment

saurabhshah0410 Mar 13, 2018

Choose a reason for hiding this comment