feat(querylang): language support for term tokenization #6269

chewxy · 2020-08-25T04:44:09Z

The current tokenization method breaks the use of anyofterms and allofterms for languages like Chinese or Japanese. The issue was first raised here, where tokenization in a language where the space character isn't used to delimit words (e.g. Chinese).

This PR fixes that as a first step to removing Bleve in term tokenization.

Additional language support to the TermTokenizer is added, so that GetTokenizerByLang works when TermTokenizer gets passed in. If it's a language in which we know doesn't use space to delimit words, then a simple strings.Split is called.

Issues That Require Resolution

Currently no attempt is made to clean up punctuations and symbols. The reason for this is that it gets quite complicated with language specific punctuation.

An example: "﹏封神演義﹏" is a valid string term, where the "﹏", although a punctuation, is really more of a kind of "capitalization" in Chinese.

Another example: "贝拉克·奥巴马" (Barack Obama in Chinese), has a middle dot, which is part of the string term for formal names. In English, the correct tokenization of "Barack Obama" would be one word. Currently our tokenizer does NOT do the correct thing for such cases. In Chinese, this is made a lot easier by means off having the middle dot.

Another example: "사과(沙果)는" is one word in Korean (it means apple). Because the rules of Korean is agglutinative, if you split the word up into tokens "사과", "沙果", "는" that's wrong, because "沙果" isn't hangul, and "는" marks that the parens is part of the word

One way we can do this is to parse out all punctuation marks and replace them with spaces, then strings.Split would simply produce the same semantic results as the English tokenizer.

This change is

CLAassistant · 2020-08-25T04:44:13Z

All committers have signed the CLA.

… it out

manishrjain

Reviewed 1 of 3 files at r1, 2 of 2 files at r2.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @vvbalaji-dgraph)

Added support for term tokenization to tokenize by language

4b2d50b

chewxy requested review from manishrjain and vvbalaji-dgraph as code owners August 25, 2020 04:44

Cleaned up antipattern, added ideal behaviour in tests, but commented…

67e77e3

… it out

manishrjain approved these changes Sep 1, 2020

View reviewed changes

chewxy merged commit 20a067b into master Sep 7, 2020

joshua-goldstein deleted the chewxy/termtokenizer branch August 11, 2022 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(querylang): language support for term tokenization #6269

feat(querylang): language support for term tokenization #6269

chewxy commented Aug 25, 2020 •

edited

Loading

CLAassistant commented Aug 25, 2020 •

edited

Loading

manishrjain left a comment

feat(querylang): language support for term tokenization #6269

feat(querylang): language support for term tokenization #6269

Conversation

chewxy commented Aug 25, 2020 • edited Loading

Issues That Require Resolution

CLAassistant commented Aug 25, 2020 • edited Loading

manishrjain left a comment

Choose a reason for hiding this comment

chewxy commented Aug 25, 2020 •

edited

Loading

CLAassistant commented Aug 25, 2020 •

edited

Loading