Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVEMENT] [FIX] Improve the start and end timestamps of extracted burned in captions #962

Merged
merged 1 commit into from
Mar 12, 2018

Conversation

saurabhshah0410
Copy link
Contributor

In raising this pull request, I confirm the following (please check boxes):

  • I have read and understood the contributors guide.
  • I have checked that another pull request for this purpose does not exist.
  • I have considered, and confirmed that this submission will be valuable to others.
  • I accept that this submission may not be used, and the pull request closed at the will of the maintainer.
  • I give this submission freely, and claim no ownership to its content.

My familiarity with the project is as follows (check one):

  • I have never used CCExtractor.
  • I have used CCExtractor just a couple of times.
  • I absolutely love CCExtractor, but have not contributed previously.
  • I am an active contributor to CCExtractor.

The start and end timestamps of extracted burned in captions are flawed
and off by a large difference. Also, the start time of the first burned
in caption extracted is always zero, which is not always the case. And
the extracted captions always appear in continuous timestamps.

To see that, you can download this file from the UK TV Samples in the samples repository:
https://drive.google.com/open?id=0B_61ywKPmI0TdlRWcVdnajVJUWs

Since the duration of that file is 15 minutes, we can trim it down to 30 seconds for our purposes:
ffmpeg -i BBC1.mp4 -acodec copy -vcodec copy -scodec copy -ss 00:00:00 -t 00:00:30 bbc.mp4

This will generate the first 30 seconds of the BBC1.mp4 in bbc.mp4.
Now, before this commit, if I compile ccextractor with hard subs enabled and run the following command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60,
the generated bbc.srt(ignore the weird looking characters, that is just OCR not giving a good output I guess) is:

1
00:00:00,000 --> 00:00:08,500
Oh, no. No, lim tired.

2
00:00:08,502 --> 00:00:09,380
.‘7 -
Oh, no. No, I'm tired.

3
00:00:09,382 --> 00:00:14,380
Baby shower was rubbish.
I'm just going to go to bed.

4
00:00:14,382 --> 00:00:15,340
Are you OK? '_‘ -' _: 1:. . .. '... .

5
00:00:15,342 --> 00:00:16,500
Are you OK?

6
00:00:16,502 --> 00:00:18,340
Are you OK? ‘

7
00:00:18,342 --> 00:00:23,460
All right. Well, don't stay up
'too late. You've got a lot to do.

8
00:00:23,462 --> 00:00:28,380
I ..' Night-night.

9
00:00:28,382 --> 00:00:28,380
“Sir-ht :l 1 .

One can see that the timings are clearly flawed, and are off by a large margin.
Also, the way that the code is written, the first extracted caption will always have
a starting timestamp of 00:00:00. Also, the extracted subtitles always come one after
the other in terms of time(i.e., there is only a 2 ms gap between two consecutive captions
making it look like the whole video had hard subs in it throughout).

This commit improves the start and end timestamps of the extracted
burned in captions and reduces the error significantly, bringing the
timestamps fairly close to the actual timings as they appear in the
media file.

With these changes included, and running the command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60,
the generated bbc.srt is:

1
00:00:07,342 --> 00:00:08,500
.‘7 -
Oh, no. No, I'm tired.

2
00:00:09,382 --> 00:00:10,300
Baby shower was rubbish.
I'm just going to go to bed.

3
00:00:14,382 --> 00:00:15,340
Are you OK?

4
00:00:15,342 --> 00:00:16,500
Are you OK? ‘

5
00:00:17,382 --> 00:00:22,380
All right. Well, don't stay up*
too late. You've got a lot to do.

6
00:00:22,382 --> 00:00:23,460
I ..' Night-night.

One can see that while the OCR output is the same, the timings have improved and
are closer to the actual timings in the media file.

The start and end timestamps of extracted burned in captions are flawed
and off by a large difference. Also, the start time of the first burned
in caption extracted is always zero, which is not always the case. And
the extracted captions always appear in continuous timestamps.

This commit improves the start and end timestamps of the extracted
burned in captions and reduces the error significantly, bringing the
timestamps fairly close to the actual timings as they appear in the
media file.
if(subtitle_text) {
char *double_enter = strstr(subtitle_text,"\n\n");
if(double_enter!=NULL)
*(double_enter)='\0';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is correct? It would terminate the string on the first one, not the second one, so it removes both \n.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added this line. It was there in the source code before I created this pr.

@cfsmp3 cfsmp3 merged commit 86356ba into CCExtractor:master Mar 12, 2018
@saurabhshah0410 saurabhshah0410 deleted the improvement branch March 16, 2018 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants