[BUG] DVB Teletext subtitle incomplete #922

mkver · 2018-01-28T03:25:44Z

CCExtractor version (using the --version parameter preferably) : 0.86 (Git commit 5fa8339)

In raising this issue, I confirm the following (please check boxes, eg [X] - and delete unchecked ones):

I have read and understood the contributors guide.
I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
I have checked that the issue I'm posting isn't already reported.
I have checked that the issue I'm porting isn't already solved and no duplicates exist in closed issues and in opened issues
I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.
I have used the latest available version of CCExtractor to verify this issue exists.

My familiarity with the project is as follows (check one, eg [X] - and delete unchecked ones):

I absolutely love CCExtractor, but have not contributed previously.

Necessary information

Is this a regression (did it work before)? [x] NO | [ ] YES
What platform did you use? [x] Windows - [ ] Linux - [ ] Mac
What were the used arguments? ccextractorwin.exe --gui_mode_reports -autoprogram -out=srt -bom -utf8 -tpage 150

Link to the file
Additional information

When I extract the teletext from page 150 of the sample file (that actually contains another subtitle at page 799, but that is irrelevant), the subtitles are incomplete:

1
00:00:00,200 --> 00:00:04,559
<font color="#00ffff">Sie können ihn nicht von der Schule</font>
<font color="#00ffff">schmeißen! Schulpflicht!</font>

2
00:00:05,080 --> 00:00:09,599
Für diese Klassenstufe ist er nicht
<font color="#00ffff">Stufen Sie ihn zurück!</font>

The second subtitle is incomplete. Here is what the libzvbi teletext decoder inside ffmpeg produces:

1
00:00:00,240 --> 00:00:04,640
Sie können ihn nicht von der Schule
schmeißen! Schulpflicht!

2
00:00:05,120 --> 00:00:09,680
Für diese Klassenstufe ist er nicht
geeignet.     Stufen Sie ihn zurück!

It unfortunately doesn't emit colours; but it includes the "geeignet."
FYI VLC also detects the "geeignet." -- and it can show the colours:

The text was updated successfully, but these errors were encountered:

cfsmp3 · 2018-01-29T20:33:09Z

I can confirm this happens and I've looked into it a bit but I'm leaving the solution for GSoC students. Some pointers though.

The missing word is present in the stream and processed by our teletext decoder.
The row (number 22) bytes look like this:

Processing row: 22
0 | D |
01 | B | �
02 | B | �
03 | 67 | g
04 | 65 | e
05 | 65 | e
06 | 69 | i
07 | 67 | g
08 | 6E | n
09 | 65 | e
10 | 74 | t
11 | 2E | .
12 | A |

13 | 20 |
14 | 6 | �
15 | B | �
16 | B | �
17 | 53 | S
18 | 74 | t
19 | 75 | u
20 | 66 | f
21 | 65 | e
22 | 6E | n
23 | 20 |
24 | 53 | S
25 | 69 | i
26 | 65 | e
27 | 20 |
28 | 69 | i
29 | 68 | h
30 | 6E | n
31 | 20 |
32 | 7A | z
33 | 75 | u
34 | 72 | r
35 | FC | ⁿ
36 | 63 | c
37 | 6B | k
38 | 21 | !
39 | A |

The relevant teletext document is ETS 300 706, which is a public standard. First result in Google.
For the character 0xB (which you can see above that between the two first words in the row) it says:

0/B Start Box ("Set-After")
On pages with the C5 or C6 bits set (Newsflash or subtitle), this code defines (on
each appropriate row) the start of an area that is to be boxed into the normal video
picture. Characters outside this area are not displayed, but changes in display
mode, colour, height etc., will affect the boxed area. Cancelled by an End Box
code (0/A) or by the start of a new row.
NOTE: Protection against false operation is provided by double transmission
of Start Box control characters, with the action taking place between
them.

We have this loop to determine where a row has data:

		uint8_t col_start = 40;
		uint8_t col_stop = 40;

		for (int8_t col = 39; col >= 0; col--)
		{
			if (page->text[row][col] == 0xb)
			{
				col_start = col;
				line_count++;
				break;
			}
		}
		// line is empty
		if (col_start > 39)
			continue;

So you can see that everything to the left of the right-most '0xb' is going to be ignored.
Also note that if we just comment out that loop we'll always have col_start==40 so the line will be considered to be empty. Removing the loop is not a good solution.

mkver · 2018-02-12T03:15:32Z

Thanks to @BPYap and @cfsmp3 for the time you have put into this. I thought it worthwhile to upload a few more samples showing this behaviour. They are here. (I have stripped all the audio and video tracks away to save space.)

BPYap · 2018-02-12T08:04:44Z

Hi @mkver , thank you for more samples.
I have modified my original pull request to replace all 0xA appearing before 0xB with 0x20 characters instead of replacing only one 0xA character.

The test results are as follow:

Sample 1
The row which the extractor ignored in this sample is row 22 shown below:
0d | 02 | 0b | 0b | 6d | 61 | 6e | 20 | 64 | 61 | 7a | 75 | 20 | 6e | 6f | 63 | 68 | 3f | 0a | 0a |
05 | 0b | 0b | 22 | 4c | 65 | 74 | 7a | 74 | 65 | 72 | 20 | 57 | 69 | 6c | 6c | 65 | 2e | 22 | 0a

Characters in bold are bounded by start box (denoted by double 0/B characters) but is ignored by CCextractor. In the ETS 300 706 documentation, the introduction of 0/A would cancel the effect of 0/B. Hence, I replace both 0/A and 0/B characters if 0/A is found within the start box.

The correct output for this row should be man dazu noch? "Letzter Wille."

original version output (extracted without any argument):
fixed version output (extracted without any argument):

===================================================================
Sample 2
It seems nothing is detected when I ran ccextractor with and without -teletext option for both original and fixed version

===================================================================
Sample 3
The row concerned is row 22 as shown below:
20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 0d | 06 | 0b | 0b | 50 | 61 |
70 | 61 | 21 | 20 | 20 | 20 | 20 | 07 | 0b | 0b | 50 | 61 | 70 | 61 | 21 | 0a | 0a | 20 | 20 | 20

Characters in bold are bounded by start box (denoted by double 0/B characters). Here 0/A is not found within the start box area hence it is intended that only | 50 | 61 | 70 | 61 | 21 | 0a | 0a | 20 | 20 | 20 would be extracted.

The correct output for this row would be Papa!

original version output (extracted without any argument):
fixed version output (extracted without any argument):
same as the original output above

===================================================================
Sample 4
The row concerned is row 22 as shown below:
20 | 20 | 20 | 0d | 05 | 0b | 0b | 48 | 61 | 6c | 6c | 6f | 21 | 06 | 20 | 20 | 20 | 20 | 20 | 20 |
20 | 20 | 20 | 20 | 20 | 0b | 0b | 48 | 69 | 21 | 0a | 0a | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20

Characters in bold are bounded by start box (denoted by double 0/B characters). Here 0/A is not found within the start box area hence it is intended that only | 48 | 69 | 21 | 0a | 0a | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 would be extracted.

The correct output for this row would be Hi!

original version output (extracted without any argument):
fixed version output (extracted without any argument):
same as the original output above

===================================================================
Sample 5
CCextractor is able to extract the teletext completely

===================================================================
Sample 6
The row concerned is row 22 as shown below:
20 | 20 | 20 | 20 | 20 | 20 | 20 | 0d | 02 | 0b | 0b | 57 | 61 | 73 | 3f | 0a | 0a | 07 | 0b | 0b |
53 | 65 | 69 | 20 | 6e | 69 | 63 | 68 | 74 | 20 | 73 | 6f | 20 | 6e | 61 | 69 | 76 | 21 | 0a | 0a

The correct output for this row should be Was? Sei nicht so naiv!

original version output (extracted without any argument):
fixed version output (extracted without any argument):

===================================================================
Sample 7
The row concerned is row 22 as shown below:
06 | 0b | 0b | 53 | 69 | 65 | 68 | 73 | 74 | 20 | 64 | 75 | 3f | 20 | 20 | 05 | 0b | 0b | 44 | 75 |
20 | 68 | 21 | 74 | 74 | 65 | 73 | 74 | 20 | 72 | 65 | 63 | 68 | 74 | 2e | 0a | 0a | 20 | 20 | 20

Here 0/A is not found within the start box area hence it is intended that it would be ignored.

original version output (extracted without any argument):
fixed version output (extracted without any argument):
same as the original output above

I also noticed that despite there is caption found in the output srt file, the extraction process for samples 1,3,4,5,6 would show "No captions were found in input." Is this a possible bug?

@cfsmp3
Please correct me if my interpretation of the ETS 300 706 documentation is wrong. Thanks :)

mkver · 2018-02-12T12:57:49Z

Once again, thank you @BPYap!
Sample 2 needs -in=ts in order to be detected as transport stream. (Or you can delete bytes 376-544 of the file. At first, I just extracted just the teletext PID and the PAT; ccextractor couldn't work with the resulting files and it turned out that the PMT was missing. So I inserted them manually and apparently I messed it up with the second sample: the 169 bytes mentioned above are an incomplete ts packet which results in wrong file type detection.)
libzvbi gives me "Papa! Papa!" for the third sample. Actually, these two words come from two different speakers and should be put in two different colours (VLC shows it in the colours I expect them to be (these subtitles use color information to indicate who is speaking)). I'm not saying that your interpretation (that a missing 0xA indicates that parts should not be displayed) is incorrect, but this is a bit odd.
For sample 4 libzvbi gives "Hallo! Hi!".
For sample 5 the first line from libzvbi is "Huhu, Ha-We! Claudia!".
For sample 7 the first line from libzvbi is "Siehst du? Du hattest recht.".
In all these instances, the text produced by libzvbi matches what is actually said.

BPYap · 2018-02-12T13:56:14Z

Thank you @mkver for your response 😄
I ran extraction on sample 2 with -in=ts and got
Bestimmt 1 Jahr. Vielleicht ...
Soll ich mich erkundigen?, should be correct because here the 0/A is within the starting box.

I will research deeper into startbox and endbox behavior, hopefully I can come up with a solution tomorrow 😄

mkver · 2018-02-12T14:15:18Z

Yes, this agrees with libzvbi.
I have also started reading ETS 300706.

mkver · 2018-02-12T18:01:31Z

The relevant part of the standard is section 12.2. It contains two important pieces of information: "Unless operating in "Hold Mosaics" mode, each character space occupied by a spacing attribute is displayed as a SPACE." So your approach to replace 0xA and 0xB with spaces is what the standard says.
And I have two interpretations concerning the boxes that are not explicitly closed:

The first one is simply that the first 0xB 0xB starts a box that is not explicitly closed, hence the box reaches until the end of the row and all characters after 0xB 0xB are displayed. That there are further 0xB 0xB inside this box doesn't change that the characters between the two pairs of 0xB0xB are intent for display.
The standard also says: "The action of an attribute persists until the end of a row or until the transmission of a further attribute that modifies its action." I think if there is the start of another box, then we could see this as a further attribute that modifies its action (by opening a new box). Hence the box is implicitly closed.
This teletext is against the spec. Given that I haven't found anything that says that a box has to be closed before another one can be opened and given that this comes from HR, a German public broadcaster and that these were very active in the development of the teletext specifications I doubt it.

The first two interpretations directly imply that one should not require the presence of 0xA; the third one does not speak against it, but that one should rather check whether not requiring 0xA has adverse effects on other samples. If there are no such samples, not requiring 0xA would be wise. If there are, the best thing would be an option for the user to choose for himself.

BPYap · 2018-02-13T10:34:58Z

I agree that subtitle box should be implicitly closed when another 0xB is encountered.

I modified the loop so that it starts from the first column until it encounters the first 0xB character (where it marks this index as starting index for further processing), then replace all subsequent 0xB and 0xA characters (except the last 0xA) with 0x20.

Further testing is done on samples 1 to 7 and the results obtained matched the output of what libzvbi would produce.

cfsmp3 added bug GSoC-related teletext labels Jan 29, 2018

BPYap mentioned this issue Feb 8, 2018

[FIX] DVB Teletext subtitle incomplete #922 #926

Merged

9 tasks

mkver mentioned this issue Feb 13, 2018

[FIX] Minor issue with spaces in teletext #930

Merged

6 tasks

mkver closed this as completed Feb 16, 2018

mkver mentioned this issue Feb 25, 2018

[BUG] Code page problems #937

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DVB Teletext subtitle incomplete #922

[BUG] DVB Teletext subtitle incomplete #922

mkver commented Jan 28, 2018

cfsmp3 commented Jan 29, 2018

mkver commented Feb 12, 2018 •

edited

Loading

BPYap commented Feb 12, 2018 •

edited

Loading

mkver commented Feb 12, 2018

BPYap commented Feb 12, 2018

mkver commented Feb 12, 2018

mkver commented Feb 12, 2018

BPYap commented Feb 13, 2018 •

edited

Loading

[BUG] DVB Teletext subtitle incomplete #922

[BUG] DVB Teletext subtitle incomplete #922

Comments

mkver commented Jan 28, 2018

cfsmp3 commented Jan 29, 2018

mkver commented Feb 12, 2018 • edited Loading

BPYap commented Feb 12, 2018 • edited Loading

mkver commented Feb 12, 2018

BPYap commented Feb 12, 2018

mkver commented Feb 12, 2018

mkver commented Feb 12, 2018

BPYap commented Feb 13, 2018 • edited Loading

mkver commented Feb 12, 2018 •

edited

Loading

BPYap commented Feb 12, 2018 •

edited

Loading

BPYap commented Feb 13, 2018 •

edited

Loading