Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word segmentation of LatexEquation123 #39

Closed
farleylai opened this issue Mar 21, 2019 · 4 comments
Closed

Word segmentation of LatexEquation123 #39

farleylai opened this issue Mar 21, 2019 · 4 comments

Comments

@farleylai
Copy link

farleylai commented Mar 21, 2019

I recently found some examples are not segmented as properly as expected. For instance, the segmentation of LatexEquation123 is La tex Equ at ion 123 but the expected output should be Latex Equation 123. I checked the frequency entries in frequency_dictionary_en_82_765.txt and found latex and equation.

Is this expected in terms of the algorithm?

@mammothb
Copy link
Owner

This is expected behavior, since the input contains capital letters, it prevents lookup from exiting early with an exact match (latex). And although, latex is found in the dictionary, the frequency of la (157960401) is much higher than the frequency of latex (10502825), Latex is split into La and tex.

If you pass in latexequation123 with max_edit_distance=0, the output will be latex equation 123 as expected.

@farleylai
Copy link
Author

farleylai commented Mar 22, 2019

Thanks for the clarification.
Setting max_edit_distance to 1 gives the desirable results in lower cases.
However, in this Camel case, the actual expectation is to ignore the edit distance introduced by the capital letter for a case insensitive match.
This feels somewhat different from increasing the max_edit_distance by one.
Is there an option for this that retains the original case without adding capitalized words to the dictionary?

@mammothb
Copy link
Owner

Currently there is no such option, the original author has suggested a possible solution but has not implemented it in the original code. I am not sure how to implement this in the current code.

@farleylai
Copy link
Author

Alright, it seems like lowering the case beforehand and capitalizing later could be the workaround for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants