You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You may remember that there is a set of Russian recordings that are broadcast with some sort of non-Cyrillic characters, for instance 2017-07-17_1158_RU_Первый_Новости_с_субтитрами.txt.
AP has been so kind as to provide a mapping table for these broken Russian files, and I was able to run it on our existing dataset. However, things are made a bit more complicated by the fact that there are HTML tags included, and with a simple search and replace I am replacing the characters in those, too, with Cyrillic characters, effectively destroying them. This happens to lines such as the following:
<font color="#00ff00">-Otca sudili po lovnomu</font> <font color="#00ff00">obwineniè w âpionave.</font>
I can of course fix this (and will probably do that for our current files anyway), but still it would be great if CCExtractor was able to provide the mapped text correctly.
I think you should be able to simply use the following mapping table and be good:
# Comments:
# "E": "Ё" and "e": "ё" had to be taken out because E/e maps to both the variant with and without trema, but it is common in Russian to only have E/e
# "Ъ" cannot occur at the beginning of a word, so there is no upper-case variant.
# Also note, that sometimes & is used instead of &. We treat this below through a replacement
mapping = {"A": "А", "B": "Б", "W": "В", "G": "Г", "D": "Д", "E": "Е", "V": "Ж", "Z": "З", "I": "И", "J": "Й", "K": "К", "L": "Л", "M": "М", "N": "Н", "O": "О", "P": "П", "R": "Р", "S": "С", "T": "Т", "U": "У", "F": "Ф", "H": "Х", "C": "Ц", "î": "Ч", "ë": "Ш", "ù": "Щ", "#": "Ы", "X": "Ь", "ê": "Э", "à": "Ю", "Q": "Я", "a": "а", "b": "б", "w": "в", "g": "г", "d": "д", "e": "е", "v": "ж", "z": "з", "i": "и", "j": "й", "k": "к", "l": "л", "m": "м", "n": "н", "o": "о", "p": "п", "r": "р", "s": "с", "t": "т", "u": "у", "f": "ф", "h": "х", "c": "ц", "ç": "ч", "â": "ш", "û": "щ", "y": "ъ", "&": "ы", "x": "ь", "ô": "э", "è": "ю", "q": "я"}
The text was updated successfully, but these errors were encountered:
@cfsmp3 Could you please provide one of the video files from this set? I would like to take a look at it to be sure that this mapping will cover all possible variants.
This came from our friends at Red Hen:
You may remember that there is a set of Russian recordings that are broadcast with some sort of non-Cyrillic characters, for instance 2017-07-17_1158_RU_Первый_Новости_с_субтитрами.txt.
AP has been so kind as to provide a mapping table for these broken Russian files, and I was able to run it on our existing dataset. However, things are made a bit more complicated by the fact that there are HTML tags included, and with a simple search and replace I am replacing the characters in those, too, with Cyrillic characters, effectively destroying them. This happens to lines such as the following:
I can of course fix this (and will probably do that for our current files anyway), but still it would be great if CCExtractor was able to provide the mapped text correctly.
I think you should be able to simply use the following mapping table and be good:
The text was updated successfully, but these errors were encountered: