Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(querylang): language support for term tokenization #6269

Merged
merged 2 commits into from
Sep 7, 2020

Conversation

chewxy
Copy link
Contributor

@chewxy chewxy commented Aug 25, 2020

The current tokenization method breaks the use of anyofterms and allofterms for languages like Chinese or Japanese. The issue was first raised here, where tokenization in a language where the space character isn't used to delimit words (e.g. Chinese).

This PR fixes that as a first step to removing Bleve in term tokenization.

Additional language support to the TermTokenizer is added, so that GetTokenizerByLang works when TermTokenizer gets passed in. If it's a language in which we know doesn't use space to delimit words, then a simple strings.Split is called.

Issues That Require Resolution

Currently no attempt is made to clean up punctuations and symbols. The reason for this is that it gets quite complicated with language specific punctuation.

An example: "﹏封神演義﹏" is a valid string term, where the "﹏", although a punctuation, is really more of a kind of "capitalization" in Chinese.

Another example: "贝拉克·奥巴马" (Barack Obama in Chinese), has a middle dot, which is part of the string term for formal names. In English, the correct tokenization of "Barack Obama" would be one word. Currently our tokenizer does NOT do the correct thing for such cases. In Chinese, this is made a lot easier by means off having the middle dot.

Another example: "사과(沙果)는" is one word in Korean (it means apple). Because the rules of Korean is agglutinative, if you split the word up into tokens "사과", "沙果", "는" that's wrong, because "沙果" isn't hangul, and "는" marks that the parens is part of the word

One way we can do this is to parse out all punctuation marks and replace them with spaces, then strings.Split would simply produce the same semantic results as the English tokenizer.


This change is Reviewable

@CLAassistant
Copy link

CLAassistant commented Aug 25, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 3 files at r1, 2 of 2 files at r2.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @vvbalaji-dgraph)

@chewxy chewxy merged commit 20a067b into master Sep 7, 2020
@joshua-goldstein joshua-goldstein deleted the chewxy/termtokenizer branch August 11, 2022 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants