[IMPROVEMENT] [FIX] Improve the start and end timestamps of extracted burned in captions #962
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In raising this pull request, I confirm the following (please check boxes):
My familiarity with the project is as follows (check one):
The start and end timestamps of extracted burned in captions are flawed
and off by a large difference. Also, the start time of the first burned
in caption extracted is always zero, which is not always the case. And
the extracted captions always appear in continuous timestamps.
To see that, you can download this file from the UK TV Samples in the samples repository:
https://drive.google.com/open?id=0B_61ywKPmI0TdlRWcVdnajVJUWs
Since the duration of that file is 15 minutes, we can trim it down to 30 seconds for our purposes:
ffmpeg -i BBC1.mp4 -acodec copy -vcodec copy -scodec copy -ss 00:00:00 -t 00:00:30 bbc.mp4
This will generate the first 30 seconds of the
BBC1.mp4
inbbc.mp4
.Now, before this commit, if I compile ccextractor with hard subs enabled and run the following command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60
,the generated
bbc.srt
(ignore the weird looking characters, that is just OCR not giving a good output I guess) is:One can see that the timings are clearly flawed, and are off by a large margin.
Also, the way that the code is written, the first extracted caption will always have
a starting timestamp of 00:00:00. Also, the extracted subtitles always come one after
the other in terms of time(i.e., there is only a 2 ms gap between two consecutive captions
making it look like the whole video had hard subs in it throughout).
This commit improves the start and end timestamps of the extracted
burned in captions and reduces the error significantly, bringing the
timestamps fairly close to the actual timings as they appear in the
media file.
With these changes included, and running the command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60
,the generated
bbc.srt
is:One can see that while the OCR output is the same, the timings have improved and
are closer to the actual timings in the media file.