-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improvement of InformalNormalizer #214
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove test txt files and other files which is not necessary to include in the repo.
Some recommendations would be given in order to improve speed and code clarity.
Please check the speed before and after modfying the code.
Code/TurnSemiSpaceFileToSpaceFile.py
Outdated
# sourceFileAddress = "./output-test-formal.txt" | ||
# destinationFileAddress = "./output-test-formal-space.txt" | ||
sourceFileAddress = "./shekasteh-test.tok.formal" | ||
destinationFileAddress = "./shekasteh-test-space.tok.formal" | ||
|
||
def main(sourceAddress,destinationAddress): | ||
with open(sourceAddress, "r", encoding='utf-8') as readFile, open(destinationAddress, "w", encoding='utf-8') as writeFile: | ||
while True: | ||
line = readFile.readline().strip() | ||
if not line: | ||
break | ||
line = line.replace('', ' ') | ||
line = line.replace('', ' ') | ||
line = line.replace('.', '') | ||
line = line.replace('؟', '') | ||
line = line.replace('!', '') | ||
writeFile.write(line + "\n") | ||
|
||
|
||
main(sourceFileAddress,destinationFileAddress) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not necessary to include in the repository.
Code/output-dev-broken.txt
Outdated
@@ -0,0 +1,917 @@ | |||
. باید جدا بشویم تا فضای بیشتری رو بتوانیم چک کنیم |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary to include in the repo.
Code/output-dev-formal.txt
Outdated
@@ -0,0 +1,917 @@ | |||
. باید جدا بشویم تا فضای بیشتری را بتوانیم چک کنیم |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary to include in the repo.
Code/output-test-broken.txt
Outdated
@@ -0,0 +1,1012 @@ | |||
من مگر این را بهت نگفتم ؟ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary to include in the repo.
Code/output-test-formal.txt
Outdated
@@ -0,0 +1,1012 @@ | |||
من مگر این را بهت نگفتم ؟ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessary to include in the repo.
hazm/InformalNormalizer.py
Outdated
def appendSuffixToWord(OneCollectionOfWordAndSuffix): | ||
mainWord = OneCollectionOfWordAndSuffix["word"] | ||
suffixList = OneCollectionOfWordAndSuffix["suffix"] | ||
adhesiveAlphabet = ["ب", "پ", "ت", "ث", "ج", "چ", "ح", "خ", "س", "ش", "ص", "ض", "ع", "غ", "ف", "ق", "ک", "گ", "ل", "م", "ن", "ه", "ی"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convert it to set for speedup
hazm/InformalNormalizer.py
Outdated
# if suffixList[i] == "هاست": | ||
# for alphabet in adhesiveAlphabet: | ||
# if returnWord.endswith(alphabet): | ||
# returnWord += "" | ||
# break | ||
# returnWord += "ها است" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove comment parts if are no longer needed
Code/standardize.py
Outdated
@@ -0,0 +1,146 @@ | |||
# from break_words import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be added to the repository.
eliminate comments eliminate unnecessary files
delete duplicate code
speed up InformalNormalizer
No description provided.