Predicts garbage for Bengali input #110

hafiz031 · 2022-01-06T03:17:49Z

I am trying this lookup_compound | Keep original casing example on a Bengali corpus of unigrams and bigrams. As a separator I have used comma. But it seems to be not working. For any misspelled input it is just outputting garbage string.

mammothb · 2022-01-06T03:37:02Z

Do you a sample code snippet which can reproduce your error? If possible, can you also provide the dictionary files you have used? You can probably just use a snippet of the dictionary if the file size is too large.

hafiz031 · 2022-01-06T04:44:29Z

@mammothb here you go!
unigrams.txt

bigrams.txt

hafiz031 · 2022-01-06T04:47:58Z

@mammothb for example try to correct: রিসেট (it is already correct btw.) but it will split all the letters and also eliminate some characters and output will be like: র স ট

mammothb · 2022-01-06T14:51:24Z

Have you tried setting split_by_space=True in lookup_compound? I was able to get the same output as input with the following code

from pathlib import Path

from symspellpy import SymSpell

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

sym_spell.load_dictionary(
    Path(__file__).resolve().parent / "unigrams.txt",
    term_index=0,
    count_index=1,
    separator=",",
    encoding="utf-8",
)
sym_spell.load_bigram_dictionary(
    Path(__file__).resolve().parent / "bigrams.txt",
    term_index=0,
    count_index=1,
    separator=",",
    encoding="utf-8",
)

input_term = "রিসেট"
suggestions = sym_spell.lookup_compound(
    input_term, max_edit_distance=2, split_by_space=True
)
for suggestion in suggestions:
    print(suggestion)
print(input_term)

Output:

রিসেট, 0, 1485
রিসেট

hafiz031 · 2022-01-06T17:01:44Z

@mammothb no, I just used the example as it is in the documentation... I didn’t change any parameters.

mammothb · 2022-01-06T23:05:55Z

Try and see if split_by_space=True works for you. The default split uses regex which I think doesn't work well with Bengali

hafiz031 · 2022-01-07T08:51:51Z

@mammothb yes it works, thanks!

hafiz031 mentioned this issue Jan 6, 2022

Predicts garbage for Bengali input wolfgarbe/SymSpell#119

Open

hafiz031 closed this as completed Jan 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predicts garbage for Bengali input #110

Predicts garbage for Bengali input #110

hafiz031 commented Jan 6, 2022

mammothb commented Jan 6, 2022

hafiz031 commented Jan 6, 2022 •

edited

Loading

hafiz031 commented Jan 6, 2022

mammothb commented Jan 6, 2022 •

edited

Loading

hafiz031 commented Jan 6, 2022

mammothb commented Jan 6, 2022

hafiz031 commented Jan 7, 2022

Predicts garbage for Bengali input #110

Predicts garbage for Bengali input #110

Comments

hafiz031 commented Jan 6, 2022

mammothb commented Jan 6, 2022

hafiz031 commented Jan 6, 2022 • edited Loading

hafiz031 commented Jan 6, 2022

mammothb commented Jan 6, 2022 • edited Loading

hafiz031 commented Jan 6, 2022

mammothb commented Jan 6, 2022

hafiz031 commented Jan 7, 2022

hafiz031 commented Jan 6, 2022 •

edited

Loading

mammothb commented Jan 6, 2022 •

edited

Loading