Word segmentation of LatexEquation123 #39

farleylai · 2019-03-21T17:27:04Z

I recently found some examples are not segmented as properly as expected. For instance, the segmentation of LatexEquation123 is La tex Equ at ion 123 but the expected output should be Latex Equation 123. I checked the frequency entries in frequency_dictionary_en_82_765.txt and found latex and equation.

Is this expected in terms of the algorithm?

The text was updated successfully, but these errors were encountered:

mammothb · 2019-03-22T02:39:25Z

This is expected behavior, since the input contains capital letters, it prevents lookup from exiting early with an exact match (latex). And although, latex is found in the dictionary, the frequency of la (157960401) is much higher than the frequency of latex (10502825), Latex is split into La and tex.

If you pass in latexequation123 with max_edit_distance=0, the output will be latex equation 123 as expected.

farleylai · 2019-03-22T05:48:42Z

Thanks for the clarification.
Setting max_edit_distance to 1 gives the desirable results in lower cases.
However, in this Camel case, the actual expectation is to ignore the edit distance introduced by the capital letter for a case insensitive match.
This feels somewhat different from increasing the max_edit_distance by one.
Is there an option for this that retains the original case without adding capitalized words to the dictionary?

mammothb · 2019-03-22T05:57:17Z

Currently there is no such option, the original author has suggested a possible solution but has not implemented it in the original code. I am not sure how to implement this in the current code.

farleylai · 2019-03-22T06:02:57Z

Alright, it seems like lowering the case beforehand and capitalizing later could be the workaround for now.

mammothb closed this as completed Mar 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word segmentation of LatexEquation123 #39

Word segmentation of LatexEquation123 #39

farleylai commented Mar 21, 2019 •

edited

Loading

mammothb commented Mar 22, 2019

farleylai commented Mar 22, 2019 •

edited

Loading

mammothb commented Mar 22, 2019

farleylai commented Mar 22, 2019

Word segmentation of LatexEquation123 #39

Word segmentation of LatexEquation123 #39

Comments

farleylai commented Mar 21, 2019 • edited Loading

mammothb commented Mar 22, 2019

farleylai commented Mar 22, 2019 • edited Loading

mammothb commented Mar 22, 2019

farleylai commented Mar 22, 2019

farleylai commented Mar 21, 2019 •

edited

Loading

farleylai commented Mar 22, 2019 •

edited

Loading