-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional inflection data for RU & UK #3
Comments
Hi @1over137, thanks for your feedback! Yes, results for some languages are not particularly good. What makes That's a clear limitation of the approach used by |
Hm, would it be possible to pass a huge number of tokens through PyMorphy2 (let's say from a corpus) and generate such a list yourself? Or would that list be too large? |
I think it would better to understand what's wrong first. Are there any systematic errors that we could correct? |
Coverage is quite low for certain classes of words like participles (mainly adjectives coming from verbs, ending in мый, ющий in the nominative), especially those with a reflexive ending (-ся) (Just an impression from clicking on words on pages) Also, it's quite easy to create combine words in Russian to make new ones which may not be on the list. Is this supposed to be addressed by using the greedy option? |
Classes of words is a good place to start. Maybe they weren't in the word lists I used or something is wrong with decomposition rules. The greedy option is indeed supposed to try everything and can even work as a stemmer in some cases. Unless its logic cannot be extrapolated to all supported languages, e.g. it doesn't work for Urdu. |
Hi, Observations:
Note that the table apparently contains some typos present in the source, so some more improvements should probably be made before adding the data. I should also include another corpus that covers other domains. |
Hi @1over137, thanks for listing the differences, I like your approach but I'm not sure how to modify the software to improve performance. Judging from your results most differences come from Simplemma not intervening in certain tricky cases. If you know where to find a list of common words along with their lemmata I could add it to the list. We could also write rules to cover additional vocabulary without needing a new list, here is a basic example for English, what do you think? Line 111 in 454df56
|
I'm afraid that the inflection rules for Russian are often times not very reversible. The first thing I can think of for a source of these rules would be Migaku's de-inflection tables. Though, they might not be very suitable because their design allows for multiple possible lemma-forms for a certain ending, which is acceptable for their use case (dictionary lookups; dictionaries only have actual words as entries). One way to use this would probably be running it through a known word list (e.g. hunspell dictionaries) to see which possible lemma-forms are actually words, but that probably involves a significant architecture change to the program. When I have more time I'll run pymorphy2 through a few more corpora and somehow remove the typos, and filter out the least common words to supplement the current dictionary. |
Thanks for the file, it's not particularly long and I'm not sure I could directly use it (what's the source?) but it shows the problem with the current approach. |
Sorry for forgetting about this thread. |
No worries, thanks for providing additional information. I get your point, although the EN Wiktionary is going to get better over time using the RU Wiktionary would make sense. As I'm more interested in expanding coverage through rules at the moment I looked at the conjugations in the Migaku data, they don't look so reliable to me but I'm going to keep an eye on the resources you shared. If you know where to find a list of common suffixes and corresponding forms I'd be interested. |
https://en.wikipedia.org/wiki/Russian_declension#Nouns is a good start. However, there are going to be a lot of overlaps (as in, this token may either be one form of this hypothetical lemma or another form of that lemma, and there would be no way of distinguishing them without a knowledge of the words themselves). I don't really think rules would be very useful, other than a few narrow categories of words (abstract nouns ending in -ость, ство, adjectives in ский) Example where this can be useful:
Edit: seems like |
@1over137 Yes, I also added the following rules to save dictionary space (endings) and get better coverage (supervised prefix search). I'm not seeing a real impact on the German and Russian treebanks but it's probably because the words were already covered. We could now implement more rules or adopt a similar approach for Ukrainian. Could you please make suggestions?
|
I'm not sure where you got these endings from, they seem quite weird, some of them containing symbols no longer used in modern Russian. I'm not sure what you mean by ский being too noisy or irregular. Can you elaborate? Though, it may be useful to note that some adjectives ending in ский gets used as nouns by omitting the noun, so the intended dictionary form could be a non-male gender (for example, street name, which would be feminine). I tried to fix this list a bit:
|
There are also many more prefixes. Here is a good list. |
Thanks for the suggestions! I don't understand everything I change on Russian and Ukrainian but I tried to adapt suffix lists from the English Wiktionary, I think I took the dated ones by mistake.
I tried a newer version of the English Wiktionary and the Russian Wiktionary to expand the existing language data. Both degrade performance on the benchmark (UD-GSD). I could switch to another treebank but the main problem now seems to be the normalization of accentuated entries. Take for example ско́рость, should it stay as it is or be normalized to скорость? |
This is a good question. You have to make a decision on this. Russian normally use accents only for clarification. It is not present in normal text. Another similar, but a bit more important decision is whether to put ё back when appropriate, remove in all cases, or simply handle both cases as separate words. Reduced performance is not quite enough information here. Can you compile a differences table between difference choices of rules enabled, dictionaries, used, etc? These corpora are not always 100% accurate either. It would be interesting to see what's behind the different performance. |
@1over137 The updated pymorphy2 fork pymorphy3 might also be interesting to you. It apparently has Python 3.11 support. The new spaCy version already uses it. About the integration of the Wiktionary data: I think one has to probably remove the accent marks everywhere to get good results. (In Python this can be done by I also thought about integrating other languages in my Russian Wiktionary scraper, but I am still considering if I should try to add the Russian Wiktionary to wiktextract. (It is highly possible that this is nightmarishly difficult though.) |
Interesting thoughts, thanks. Sadly I lack the time to perform in-depth analyses of what's happening here, I look at the lemmatization accuracy and try to strike a balance. The newest version out today comprises two significant changes for Russian and Ukrainian: better data (especially for the latter) and affix decomposition (now on by default). @1over137 You should hopefully see some improvement. @Vuizur Since lemmatization can also be seen as a normalization task it makes sense to remove accent marks. I'm going to look into it. |
Note that if removing accent marks this way, you must first normalize the unicode to NFD or NFKD. You can't guarantee the mode of this from your input. |
I made a PR to let test/udscore.py write CSV files, so it will be easier to compare that by just running tests with different dictionaries and sending the CSV files here. In a cursory look, I noticed a few odd errors, such as lemmatizing something into an unrelated word: The most obvious thing here is that at they seem to be common phrases (свобода слова - freedom of speech, День Победы - victory day). Can you check what may have been the cause of this? Also, there are quite some English words or otherwise words made of Latin characters. Maybe we should skip tokens containing latin characters, since a Russian lemmatizer should not be expected to do that. Also, a large number of them are capitalization issues. |
Sorry for the barrage of comments and PRs, but I decided to do a few experiments myself to create a better dictionary. I combined the current dictionary, the list of word-forms obtained through the Wiktionary dump, and passed all of them through PyMorphy2, disregarding all previous data. This is done without using anything from the test dataset. The new dictionary file is 7.7 MiB after sorting, which you can presumably reduce by applying rules. The test results are as following:
This seems to be a significant improvement from the original. This gets rid of most of the weird results I mentioned, with most errors being pronouns and capitalization issues (for reference, pymorphy2 always lowercases the words, even proper names). The advantage of using PyMorphy2 is that it can estimate the probability of each lemma form by corpus frequency, which is optimal for a limited solution like Simplemma that can only provide one answer, whereas the Wiktionary dump seem to contain errors and may override common words with uncommon lemmas, which accounts for the poorer performance. import json
import unicodedata
import pymorphy2
from tqdm import tqdm
import lzma, pickle
with open("ruwiktionary_words.json") as f:
wikt = json.load(f)
morph = pymorphy2.MorphAnalyzer()
def removeAccents(word):
#print("Removing accent marks from query ", word)
ACCENT_MAPPING = {
'́': '',
'̀': '',
'а́': 'а',
'а̀': 'а',
'е́': 'е',
'ѐ': 'е',
'и́': 'и',
'ѝ': 'и',
'о́': 'о',
'о̀': 'о',
'у́': 'у',
'у̀': 'у',
'ы́': 'ы',
'ы̀': 'ы',
'э́': 'э',
'э̀': 'э',
'ю́': 'ю',
'̀ю': 'ю',
'я́́': 'я',
'я̀': 'я',
}
word = unicodedata.normalize('NFKC', word)
for old, new in ACCENT_MAPPING.items():
word = word.replace(old, new)
return word
def caseAwareLemmatize(word):
if not word.strip():
return word
if word[0].isupper():
return str(morph.parse(word)[0].normal_form).capitalize()
else:
return str(morph.parse(word)[0].normal_form)
with lzma.open('simplemma/data/ru.plzma') as p:
ru_orig = pickle.load(p)
# Old list, we send it through PyMorphy2
orig_dic = {}
for key in tqdm(ru_orig):
orig_dic[key] = caseAwareLemmatize(key)
# New list of word-forms from Wiktionary, we send it through PyMorphy2
lem = {}
for entry in tqdm(wikt):
for inflection in entry['inflections']:
inflection = removeAccents(inflection)
lem[inflection] = caseAwareLemmatize(inflection)
d = {k: v for k, v in sorted((orig_dic | lem).items(), key=lambda item: item[1])}
with lzma.open('simplemma/data/ru.plzma', 'wb') as p:
pickle.dump(d, p) |
Hi @1over137, thank you very much for the deep dive into the data and the build process! I cannot address the topic right now, will come back to it later, here are a few remarks:
Concretely we are looking at writing a special data import for Russian and Ukrainian. We could talk about how to adapt the main function to this case. |
Can you explain how the dictionary building works? I don't think I ever quite understood it. If writing a special data import, would it be possible to rely on PyMorphy2 for lemmas? Even without using Wiktionary data, only reprocessing the original list, using PyMorphy2 increases the accuracy from 0.854 to 0.889, which is a significant improvement before even adding any words. It avoids the complexity of choosing candidates and personally I have only ever seen it making an error a few times over a long period. Though we don't have a wiktionary export of word-forms for Ukrainian yet, the same can be applied to a corpus or a wordlist, such as from the wordfreq repository you linked, which should improve coverage quite a bit. Then we aim to capture the coverage with rules to cut down on size. We can be much more aggressive in applying rules since statistically exceptions are only common in high frequency words (adding back -ский, even adding some more adjective rules). Though, to see any different in the benchmarks, you might need to switch to a bigger corpus. The UD treebank only has ~90k tokens, so it may not even contain enough low frequency words to make a difference, especially since your scoring is frequency-weighed. |
Hi,
I'm the author of SSM which is a language learning utility for quickly making vocabulary flashcards. Thanks for this project! Without this it would have been difficult to provide multilingual lemmatization, which is an essential aspect of this tool.
However, I found that this is not particularly accurate for Russian. PyMorphy2 is a lemmatizer for Russian that I used in other projects. It's very fast and accurate in my experience, much more than spacy or anything else. Any chance you can include PyMorphy2's data in this library?
The text was updated successfully, but these errors were encountered: