Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Russian recordings #1086

Closed
cfsmp3 opened this issue May 13, 2019 · 1 comment
Closed

Issue with Russian recordings #1086

cfsmp3 opened this issue May 13, 2019 · 1 comment

Comments

@cfsmp3
Copy link
Contributor

cfsmp3 commented May 13, 2019

This came from our friends at Red Hen:

You may remember that there is a set of Russian recordings that are broadcast with some sort of non-Cyrillic characters, for instance 2017-07-17_1158_RU_Первый_Новости_с_субтитрами.txt.


AP has been so kind as to provide a mapping table for these broken Russian files, and I was able to run it on our existing dataset. However, things are made a bit more complicated by the fact that there are HTML tags included, and with a simple search and replace I am replacing the characters in those, too, with Cyrillic characters, effectively destroying them. This happens to lines such as the following:

<font color="#00ff00">-Otca sudili po lovnomu</font> <font color="#00ff00">obwineniè w âpionave.</font>

I can of course fix this (and will probably do that for our current files anyway), but still it would be great if CCExtractor was able to provide the mapped text correctly.

I think you should be able to simply use the following mapping table and be good:

# Comments:

# "E": "Ё" and "e": "ё" had to be taken out because E/e maps to both the variant with and without trema, but it is common in Russian to only have E/e

# "Ъ" cannot occur at the beginning of a word, so there is no upper-case variant.

# Also note, that sometimes &amp; is used instead of &. We treat this below through a replacement

mapping = {"A": "А", "B": "Б", "W": "В", "G": "Г", "D": "Д", "E": "Е", "V": "Ж", "Z": "З", "I": "И", "J": "Й", "K": "К", "L": "Л", "M": "М", "N": "Н", "O": "О", "P": "П", "R": "Р", "S": "С", "T": "Т", "U": "У", "F": "Ф", "H": "Х", "C": "Ц", "î": "Ч", "ë": "Ш", "ù": "Щ", "#": "Ы", "X": "Ь", "ê": "Э", "à": "Ю", "Q": "Я", "a": "а", "b": "б", "w": "в", "g": "г", "d": "д", "e": "е", "v": "ж", "z": "з", "i": "и", "j": "й", "k": "к", "l": "л", "m": "м", "n": "н", "o": "о", "p": "п", "r": "р", "s": "с", "t": "т", "u": "у", "f": "ф", "h": "х", "c": "ц", "ç": "ч", "â": "ш", "û": "щ", "y": "ъ", "&": "ы", "x": "ь", "ô": "э", "è": "ю", "q": "я"}

@thelastpolaris
Copy link
Contributor

@cfsmp3 Could you please provide one of the video files from this set? I would like to take a look at it to be sure that this mapping will cover all possible variants.

thelastpolaris added a commit to thelastpolaris/ccextractor that referenced this issue May 18, 2019
…mbols to Cyrillic ones in some of the Russian Teletext files
@cfsmp3 cfsmp3 closed this as completed in 2f09687 May 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants