Issue with Russian recordings #1086

cfsmp3 · 2019-05-13T19:12:05Z

This came from our friends at Red Hen:

You may remember that there is a set of Russian recordings that are broadcast with some sort of non-Cyrillic characters, for instance 2017-07-17_1158_RU_Первый_Новости_с_субтитрами.txt.

AP has been so kind as to provide a mapping table for these broken Russian files, and I was able to run it on our existing dataset. However, things are made a bit more complicated by the fact that there are HTML tags included, and with a simple search and replace I am replacing the characters in those, too, with Cyrillic characters, effectively destroying them. This happens to lines such as the following:

<font color="#00ff00">-Otca sudili po lovnomu</font> <font color="#00ff00">obwineniè w âpionave.</font>

I can of course fix this (and will probably do that for our current files anyway), but still it would be great if CCExtractor was able to provide the mapped text correctly.

I think you should be able to simply use the following mapping table and be good:

# Comments:

# "E": "Ё" and "e": "ё" had to be taken out because E/e maps to both the variant with and without trema, but it is common in Russian to only have E/e

# "Ъ" cannot occur at the beginning of a word, so there is no upper-case variant.

# Also note, that sometimes &amp; is used instead of &. We treat this below through a replacement

mapping = {"A": "А", "B": "Б", "W": "В", "G": "Г", "D": "Д", "E": "Е", "V": "Ж", "Z": "З", "I": "И", "J": "Й", "K": "К", "L": "Л", "M": "М", "N": "Н", "O": "О", "P": "П", "R": "Р", "S": "С", "T": "Т", "U": "У", "F": "Ф", "H": "Х", "C": "Ц", "î": "Ч", "ë": "Ш", "ù": "Щ", "#": "Ы", "X": "Ь", "ê": "Э", "à": "Ю", "Q": "Я", "a": "а", "b": "б", "w": "в", "g": "г", "d": "д", "e": "е", "v": "ж", "z": "з", "i": "и", "j": "й", "k": "к", "l": "л", "m": "м", "n": "н", "o": "о", "p": "п", "r": "р", "s": "с", "t": "т", "u": "у", "f": "ф", "h": "х", "c": "ц", "ç": "ч", "â": "ш", "û": "щ", "y": "ъ", "&": "ы", "x": "ь", "ô": "э", "è": "ю", "q": "я"}

The text was updated successfully, but these errors were encountered:

thelastpolaris · 2019-05-15T07:40:08Z

@cfsmp3 Could you please provide one of the video files from this set? I would like to take a look at it to be sure that this mapping will cover all possible variants.

…mbols to Cyrillic ones in some of the Russian Teletext files

thelastpolaris added a commit to thelastpolaris/ccextractor that referenced this issue May 18, 2019

Fixes CCExtractor#1086 by adding -latrusmap option that maps Latin sy…

889fcf6

…mbols to Cyrillic ones in some of the Russian Teletext files

thelastpolaris mentioned this issue May 18, 2019

Fixes #1086 by adding -latrusmap option that maps Latin symbols to Cy… #1087

Merged

10 tasks

cfsmp3 closed this as completed in 2f09687 May 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Russian recordings #1086

Issue with Russian recordings #1086

cfsmp3 commented May 13, 2019 •

edited

Loading

thelastpolaris commented May 15, 2019

Issue with Russian recordings #1086

Issue with Russian recordings #1086

Comments

cfsmp3 commented May 13, 2019 • edited Loading

thelastpolaris commented May 15, 2019

cfsmp3 commented May 13, 2019 •

edited

Loading