Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicts garbage for Bengali input #110

Closed
hafiz031 opened this issue Jan 6, 2022 · 7 comments
Closed

Predicts garbage for Bengali input #110

hafiz031 opened this issue Jan 6, 2022 · 7 comments

Comments

@hafiz031
Copy link

hafiz031 commented Jan 6, 2022

I am trying this lookup_compound | Keep original casing example on a Bengali corpus of unigrams and bigrams. As a separator I have used comma. But it seems to be not working. For any misspelled input it is just outputting garbage string.

@mammothb
Copy link
Owner

mammothb commented Jan 6, 2022

Do you a sample code snippet which can reproduce your error? If possible, can you also provide the dictionary files you have used? You can probably just use a snippet of the dictionary if the file size is too large.

@hafiz031
Copy link
Author

hafiz031 commented Jan 6, 2022

@mammothb here you go!
unigrams.txt

bigrams.txt

@hafiz031
Copy link
Author

hafiz031 commented Jan 6, 2022

@mammothb for example try to correct: রিসেট (it is already correct btw.) but it will split all the letters and also eliminate some characters and output will be like: র স ট

@mammothb
Copy link
Owner

mammothb commented Jan 6, 2022

Have you tried setting split_by_space=True in lookup_compound? I was able to get the same output as input with the following code

from pathlib import Path

from symspellpy import SymSpell

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

sym_spell.load_dictionary(
    Path(__file__).resolve().parent / "unigrams.txt",
    term_index=0,
    count_index=1,
    separator=",",
    encoding="utf-8",
)
sym_spell.load_bigram_dictionary(
    Path(__file__).resolve().parent / "bigrams.txt",
    term_index=0,
    count_index=1,
    separator=",",
    encoding="utf-8",
)

input_term = "রিসেট"
suggestions = sym_spell.lookup_compound(
    input_term, max_edit_distance=2, split_by_space=True
)
for suggestion in suggestions:
    print(suggestion)
print(input_term)

Output:

রিসেট, 0, 1485
রিসেট

@hafiz031
Copy link
Author

hafiz031 commented Jan 6, 2022

@mammothb no, I just used the example as it is in the documentation... I didn’t change any parameters.

@mammothb
Copy link
Owner

mammothb commented Jan 6, 2022

Try and see if split_by_space=True works for you. The default split uses regex which I think doesn't work well with Bengali

@hafiz031
Copy link
Author

hafiz031 commented Jan 7, 2022

@mammothb yes it works, thanks!

@hafiz031 hafiz031 closed this as completed Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants