feat(querylang): language support for term tokenization #6269
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current tokenization method breaks the use of
anyofterms
andallofterms
for languages like Chinese or Japanese. The issue was first raised here, where tokenization in a language where the space character isn't used to delimit words (e.g. Chinese).This PR fixes that as a first step to removing Bleve in term tokenization.
Additional language support to the
TermTokenizer
is added, so thatGetTokenizerByLang
works whenTermTokenizer
gets passed in. If it's a language in which we know doesn't use space to delimit words, then a simplestrings.Split
is called.Issues That Require Resolution
Currently no attempt is made to clean up punctuations and symbols. The reason for this is that it gets quite complicated with language specific punctuation.
An example:
"﹏封神演義﹏"
is a valid string term, where the"﹏"
, although a punctuation, is really more of a kind of "capitalization" in Chinese.Another example:
"贝拉克·奥巴马"
(Barack Obama in Chinese), has a middle dot, which is part of the string term for formal names. In English, the correct tokenization of "Barack Obama" would be one word. Currently our tokenizer does NOT do the correct thing for such cases. In Chinese, this is made a lot easier by means off having the middle dot.Another example:
"사과(沙果)는"
is one word in Korean (it means apple). Because the rules of Korean is agglutinative, if you split the word up into tokens"사과", "沙果", "는"
that's wrong, because "沙果" isn't hangul, and "는" marks that the parens is part of the wordOne way we can do this is to parse out all punctuation marks and replace them with spaces, then
strings.Split
would simply produce the same semantic results as the English tokenizer.This change is