-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] DVB Teletext subtitle incomplete #922
Comments
I can confirm this happens and I've looked into it a bit but I'm leaving the solution for GSoC students. Some pointers though.
Processing row: 22 13 | 20 |
0/B Start Box ("Set-After")
So you can see that everything to the left of the right-most '0xb' is going to be ignored. |
Hi @mkver , thank you for more samples. The test results are as follow: Sample 1 Characters in bold are bounded by start box (denoted by double 0/B characters) but is ignored by CCextractor. In the ETS 300 706 documentation, the introduction of 0/A would cancel the effect of 0/B. Hence, I replace both 0/A and 0/B characters if 0/A is found within the start box. The correct output for this row should be man dazu noch? "Letzter Wille."
=================================================================== =================================================================== Characters in bold are bounded by start box (denoted by double 0/B characters). Here 0/A is not found within the start box area hence it is intended that only | 50 | 61 | 70 | 61 | 21 | 0a | 0a | 20 | 20 | 20 would be extracted. The correct output for this row would be Papa!
=================================================================== Characters in bold are bounded by start box (denoted by double 0/B characters). Here 0/A is not found within the start box area hence it is intended that only | 48 | 69 | 21 | 0a | 0a | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 would be extracted. The correct output for this row would be Hi!
=================================================================== =================================================================== The correct output for this row should be Was? Sei nicht so naiv! =================================================================== Here 0/A is not found within the start box area hence it is intended that it would be ignored.
I also noticed that despite there is caption found in the output srt file, the extraction process for samples 1,3,4,5,6 would show "No captions were found in input." Is this a possible bug? @cfsmp3 |
Once again, thank you @BPYap! |
Thank you @mkver for your response 😄 I will research deeper into startbox and endbox behavior, hopefully I can come up with a solution tomorrow 😄 |
Yes, this agrees with libzvbi. |
The relevant part of the standard is section 12.2. It contains two important pieces of information: "Unless operating in "Hold Mosaics" mode, each character space occupied by a spacing attribute is displayed as a SPACE." So your approach to replace 0xA and 0xB with spaces is what the standard says.
The first two interpretations directly imply that one should not require the presence of 0xA; the third one does not speak against it, but that one should rather check whether not requiring 0xA has adverse effects on other samples. If there are no such samples, not requiring 0xA would be wise. If there are, the best thing would be an option for the user to choose for himself. |
I agree that subtitle box should be implicitly closed when another 0xB is encountered. I modified the loop so that it starts from the first column until it encounters the first 0xB character (where it marks this index as starting index for further processing), then replace all subsequent 0xB and 0xA characters (except the last 0xA) with 0x20. Further testing is done on samples 1 to 7 and the results obtained matched the output of what libzvbi would produce. |
CCExtractor version (using the --version parameter preferably) : 0.86 (Git commit 5fa8339)
In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):
My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):
Necessary information
ccextractorwin.exe --gui_mode_reports -autoprogram -out=srt -bom -utf8 -tpage 150
Link to the file
Additional information
When I extract the teletext from page 150 of the sample file (that actually contains another subtitle at page 799, but that is irrelevant), the subtitles are incomplete:
The second subtitle is incomplete. Here is what the libzvbi teletext decoder inside ffmpeg produces:
It unfortunately doesn't emit colours; but it includes the "geeignet."
FYI VLC also detects the "geeignet." -- and it can show the colours:
The text was updated successfully, but these errors were encountered: