-
-
Notifications
You must be signed in to change notification settings - Fork 19
2. Morphological Analysis
(English translation made by google translate.)
The TrnlpWord class dictionary and rule-based worker is an algorithm that I have prepared to find the root / body and suffixes of the word.
In order to analyze a word, trnlp first finds possible roots of the word with the help of a dictionary. Root detection algorithm works as follows;
-
Clear punctuation in the word, flip capped letters, clear right and left spaces, and lower case.
-
Break the word down by letter; (For the word "deneme" ["deneme", "denem", "dene", "den", "de"]
-
trnlp firstly controls the punctuation marks in the original word.
-
If in the original word "." If there is and the dictionary of abbreviations is opened by the user, it searches before the last point in the abbreviations dictionary (that is, it gives priority to the abbreviations dictionary) and adds it to the list of possible roots if the dictionary has an expression. Defines the letters remaining after the dot as "residual".
-
If the original word "" "is present, the dictionary of special names and abbreviations is opened by the user, it will look before the last single quotation mark in the dictionary of special names and abbreviations (that is, gives priority to the dictionary of special names and abbreviations.) And possible roots if the expression is present in the dictionary. adds to the list. Defines the letters that remain after the single quotation mark as "residual".
-
If the original word begins with a capital letter, the dictionary of special names and abbreviations is opened by the user, each expression in the list of words created by subtracting the letter is rooted in the custom names dictionary, the main dictionary and the abbreviation dictionary (that is, gives priority to the special names dictionary and the main dictionary). and adds to the list of possible roots if the dictionary has an expression. Defines the letters that remain from the fragmented word as "residual".
-
If the original word contains no phrases, if the dictionary of special names and abbreviations is opened by the user, it searches each expression in the list of words created by subtracting the letter in the main dictionary, the custom names dictionary and the abbreviations dictionary (that is, gives priority to the main dictionary and the special names dictionary) and adds a list of possible roots if the dictionary contains an expression. Defines the letters that remain from the fragmented word as "residual".
-
-
It is tried to derive the production suffixes and gravity suffixes from the suffixes respectively. If a result is reached using all letters, it is added to the list of results in the form of a dictionary. If no results were found, it returns an empty list.
After deriving an object from the TrnlpWord class, new word analysis must be done with the .setword(str) function. Since many new dictionary files are created with the object when each new object is derived, continuous definition of new objects will lead to excessive memory consumption. Therefore, it will be healthier to analyze new words on a single example.
Depending on the word lengths on my computer (Intel i5-2450M processor 4GB Memory), the number of analyzes that can be performed in 1 second varies between 2250 and 3000.
*** The sample usage of the algorithm I wrote to find the word stem / stem and suffixes *** is as follows:
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("arkadaşlar")
print(obj)
>> "arka(isim,sıfat)+daş{İi}[4_26]+lar{Çe}[1_1]"
obj.setword("Muvaffakiyetsizleştiricileştiriveremeyebileceklerimizdenmişsiniz")
print(obj)
>> muvaffakiyet(isim)+siz{İi}[4_4]+leş{İf}[5_5]+tir{Ff}[6_11]+ici{Fi}[7_3]+leş{İf}[5_5]+tir{Ff}[6_11]+iver{BfVer}[3_4]+eme{Ytsz}[3_19]+yebil{BfBil}[3_1]+ecek{Fs}[8_9]+ler{Çe}[1_1]+imiz{İe1ç}[1_4]+den{HeUzk}[1_23]+miş{EfGçMiş}[1_38]+siniz{EfKe2ç}[1_50]
Three types of dictionaries are basically used in morphological analysis. These;
- Main Dictionary: It is a dictionary with basic words. It does not include special names and abbreviations. Noun, adjective, verb etc. includes types of words.
- Special Names Dictionary: It is a dictionary with specific names. Person names, city names, country names, etc. includes types of words.
- Abbreviations Dictionary: It is a dictionary with abbreviations.
By default, "Main Dictionary" and "Custom Names Dictionary" are active and "Abbreviations Dictionary" are inactive. Dictionaries can be made active or passive at any time. If all dictionaries are disabled, "Main Dictionary" is activated automatically. Let's explain the use of dictionaries with an example;
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("esat")
print(obj)
>> "Esat(özel)"
obj.usepron = False
obj.setword("esat")
print(obj)
>> ""
Active or passive selection of dictionary usage must be made before the word input.
Makes the use of the main dictionary active / passive. True or False can be synchronized to boolean values.
Özel isimler sözlüğünün kullanımını aktif/pasif duruma getirir. True yada False boolen değerlerine eşitlenebilir.
Makes the use of the dictionary of proper names active / passive. True or False can be synchronized to boolean values.
Wiki page for details of dictionary files Sözlükler
Returns the possible root of the word, and its type.
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("arkadaşlar")
print(obj.get_base)
>> "arka"
print(obj.get_base_type)
>> "isim,sıfat"
Returns the possible body of the word, and its type.
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("arkadaşlar")
print(obj.get_stem)
>> "arkadaş"
print(obj.get_stem_type)
>> "isim"
Due to the nature of the Turkish language, it is possible to make more than one correct analysis for a word. trnlp creates a dictionary variable for each analysis while doing the motphological analysis. The ".get_morphology" command tries to return the dictionary variable of the most likely solution from many parsing.
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("koyun")
print(obj.get_morphology)
>> {'base': 'koyun', 'verifiedBase': 'koyun', 'baseType': ['isim'], 'baseProp': ['0'], 'etymon': 'Türkçe', 'event': 0, 'currentType': ['isim'], 'purview': '0', 'orgWord': 'koyun', 'word': 'koyun', 'suffixes': [], 'suffixPlace': [], 'suffixTypes': [], 'suffixProp': [], 'residual': ''}
The explanations of the data in the dictionary are as follows:
- 'base': Refers to the root of the word.
- 'verifiedBase': It refers to the verified root of the word. (It will be better understood in the next example.)
- 'baseType': Refers to the type of word root.
- 'baseProp': Flag expressions that express the sound events of the word root. (See Dictionaries)
- 'etymon': Expresses the origin of the word root.
- 'event': If 0, there is no sound event in the root. 1 sound event occurred in the root.
- 'currentType': It refers to the list of species that the word transformed after each suffix.
- 'purview': Refers to the class of the word root. '0' if the class is not defined in the dictionary.
- 'orgWord': It refers to the word entered by the user.
- 'word': It is the processed word of the entered word. The entered word is converted to lowercase and punctuation is cleared.
- 'suffixes': Refers to the list of suffixes of the word.
- 'suffixPlace': Refers to the list of table and line numbers in the appendix tables of the word's attachments. (See Suffix Tables)
- 'suffixTypes': Refers to the genre list of suffixes of the word. (See Suffix Tables)
- 'suffixProp': It refers to the list of features that the attachment has. (See Suffix Tables)
- 'residual': Refers to the letters that remain after word analysis. In fully resolved words, this part is empty.
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("oğlununmuş")
print(obj.get_morphology)
>> {'base': 'oğl', 'verifiedBase': 'oğul', 'baseType': ['isim', 'ünlem'], 'baseProp': ['UDUS'], 'etymon': 'Türkçe', 'event': 1, 'currentType': ['isim,ünlem', 'isim', 'isim', 'fiil'], 'purview': '0', 'orgWord': 'oğlununmuş', 'word': 'oğlununmuş', 'suffixes': ['un', 'un', 'muş'], 'suffixPlace': [(1, 6), (1, 17), (1, 38)], 'suffixTypes': ['İe2t', 'HeTyn', 'EfGçMiş'], 'suffixProp': [(0, 6), (0,), (0,)], 'residual': ''}
obj.setword("gelcem")
print(obj)
>> gel(fiil)+eceğ{Gkz}[2_6]+im{Ke1t}[2_20]
print(obj.get_morphology)
>> {'base': 'gel', 'verifiedBase': 'gel', 'baseType': ['fiil'], 'baseProp': ['GZ[ir]'], 'etymon': 'Türkçe', 'event': 0, 'currentType': ['fiil', 'fiil', 'fiil'], 'purview': '0', 'orgWord': 'gelcem', 'word': 'geleceğim', 'suffixes': ['eceğ', 'im'], 'suffixPlace': [(2, 6), (2, 20)], 'suffixTypes': ['Gkz', 'Ke1t'], 'suffixProp': [(0, 5), (0, 6)], 'residual': ''}
It gives a list of all possible solutions.
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("arkadaşlar")
print(obj.get_inf)
>> [{'base': 'arka', 'verifiedBase': 'arka', 'baseType': ['isim', 'sıfat'], 'baseProp': ['0'], 'etymon': 'Türkçe', 'event': 0, 'currentType': ['isim,sıfat', 'isim', 'isim'], 'purview': '0', 'orgWord': 'arkadaşlar', 'word': 'arkadaşlar', 'suffixes': ['daş', 'lar'], 'suffixPlace': [(4, 26), (1, 1)], 'suffixTypes': ['İi', 'Çe'], 'suffixProp': [(2, 4), (2,)], 'residual': ''}, {'base': 'arka', 'verifiedBase': 'arka', 'baseType': ['isim', 'sıfat'], 'baseProp': ['0'], 'etymon': 'Türkçe', 'event': 0, 'currentType': ['isim,sıfat', 'isim', 'fiil'], 'purview': '0', 'orgWord': 'arkadaşlar', 'word': 'arkadaşlar', 'suffixes': ['daş', 'lar'], 'suffixPlace': [(4, 26), (1, 52)], 'suffixTypes': ['İi', 'EfKe3ç'], 'suffixProp': [(2, 4), (2,)], 'residual': ''}, {'base': 'arka', 'verifiedBase': 'arka', 'baseType': ['isim', 'sıfat'], 'baseProp': ['0'], 'etymon': 'Türkçe', 'event': 0, 'currentType': ['isim,sıfat', 'isim', 'fiil', 'fiil'], 'purview': '0', 'orgWord': 'arkadaşlar', 'word': 'arkadaşlar', 'suffixes': ['daş', 'la', 'r'], 'suffixPlace': [(4, 26), (5, 2), (2, 8)], 'suffixTypes': ['İi', 'İf', 'Gz'], 'suffixProp': [(2, 4), (2, 10), (1,)], 'residual': ''}]
The writeable function allows dictionaries created as a result of analysis to be written in standard format. The "long" parameter can be True or False. The "long" parameter provides an annotation or normal writing of the annotations in the standard format printed.
from trnlp import TrnlpWord, writeable
obj = TrnlpWord()
obj.setword("arkadaşlar")
print(writeable(obj.get_morphology))
>> arka(isim,sıfat)+daş{İi}[4_26]+lar{Çe}[1_1]
print(writeable(obj.get_morphology, long=True))
>> arka(isim,sıfat)+daş{İsimden İsim Yapım Eki}[4_26]+lar{Çokluk Eki}[1_1]
Returns the syllables of the word in the list. If the word cannot be spelled, it returns an empty list. If spelling is done only, trnlp.helper.syllabification(word: str) -> list function can be used.
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("arkadaşlar")
print(obj.spelling())
>> ['ar', 'ka', 'daş', 'lar']
Checks whether the word is "positive" and returns a float value between 0 and 1.
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("gelmemişler")
print(obj.is_negative())
>> 1.0
obj.setword("gelme")
print(obj.is_negative())
>> 0.5
obj.setword("değil")
print(obj.is_negative())
>> 0.5
obj.setword("hayır")
print(obj.is_negative())
>> 0.5
obj.setword("şekersiz")
print(obj.is_negative())
>> 1.0
Checks whether the word is "plural" and returns a float value between 0 and 1.
from trnlp import TrnlpWord
obj = TrnlpWord()
obj.setword("gelmemişler")
print(obj.is_plural())
>> 0.6666666666666666
obj.setword("orman")
print(obj.is_plural())
>> 1.0
When used without parameters, it returns the analysis with the shortest root among all the results found in the analysis. If no analysis has been done, it returns an empty list. s_base(), l_base(), s_suffix(), l_suffix() can be sent as parameters to each other. For example;
obj.s_base(obj.s_suffix())
# Among all solutions, results with the least additions and the shortest root will be returned.
When used without parameters, it returns the longest rooted analysis in the list of all results found in the analysis. If no analysis has been done, it returns an empty list. s_base (), l_base (), s_suffix (), l_suffix () can be sent as parameters to each other.
When used without parameters, it returns the analyzes that have received the least addition among all the results found in the analysis. If no analysis has been done, it returns an empty list. s_base (), l_base (), s_suffix (), l_suffix () can be sent as parameters to each other.
When used without parameters, it returns the analyzes that have received the most appendix among all the results found in the analysis. If no analysis has been done, it returns an empty list. s_base(), l_base(), s_suffix(), l_suffix() can be sent as parameters to each other.