Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix the out of alphabet token handling in analyses generation
Solves #45 Consider alphanumeric characters to be part of the vocabulary.
- Loading branch information
944ed25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under Unicode locale, std::iswalnum returns TRUE for intuitively non-alphanumeric Unicode character. Please see the example of the following URL.
https://en.cppreference.com/w/cpp/string/wide/iswalnum
With this code, lt-proc analysis will regard some Unicode words as Unknown even if it is defined in Monodix.
944ed25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What example at https://en.cppreference.com/w/cpp/string/wide/iswalnum ?
the Cherokee letter HA ('Ꭽ')
?Can you give a concrete example of something marked as Unknown even if defined in monodix, along with the relevant extracts from monodix? I don't understand how this code could contribute to that. (You might want to open an issue)
944ed25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
944ed25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, so the problem is actually that lttoolbox expects analyses to be delimited by blanks. If you have an empty alphabet, in lttoolbox 3.5.0, then any symbol is a potential blank, so that allows
보면
to get two analyses even though they're not separated by regular spaces. But if you put any of the symbols into<alphabet>
, they woud stop being potential blanks, so e.g. on trying to analyse보
we would see that the next symbol is alphabetic and so the analysis "has to" include the next symbol in the form, so보
alone can't get an analysis.The change in 3.5.1 puts any alphanumeric into
<alphabet>
even if you don't do it yourself, basically the same as saying<alphabet>abcd…보면…食べない…etc</alphabet>
. I've opened an issue to support your use-case: #75