Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add nori_number token filter in analysis-nori #53583

Merged
merged 5 commits into from
Mar 23, 2020

Conversation

danmuzi
Copy link
Contributor

@danmuzi danmuzi commented Mar 15, 2020

The KoreanNumberFilter has included in Nori after Lucene 8.2.0. (LUCENE-8812)
However, it isn't supported now in Nori Analysis plugin. (Kuromoji supports kuromoji_number)
It seems to be missing(#30397) because KoreanNumberFilter didn't exist at Lucene 7.4.0 that supports Nori first time.
This PR is for that.

@jrodewig jrodewig added :Search Relevance/Analysis How text is split into tokens >docs General docs changes labels Mar 16, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (>docs)

@jimczi jimczi added >feature v7.7.0 v8.0.0 and removed >docs General docs changes labels Mar 16, 2020
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this filter danmuzi ! I left one comment regarding the discard_punctuation option of the tokenizer that was added to handle the number filter correctly.

docs/plugins/analysis-nori.asciidoc Show resolved Hide resolved
@danmuzi
Copy link
Contributor Author

danmuzi commented Mar 19, 2020

Thanks for your review, @jimczi 👍
I added some commits reflecting your comments.
But I'm not sure it's right to include the discard_punctuation option in this PR.
Because this PR is for nori_number token filter.
So I submitted commits separately.
If you think creating a new PR for discard_punctuation is better, I will write a new PR for that.
If so, I'll remove the discard_punctuation commit in this PR and rebase to master branch after discard_punctuation PR is submitted.
What do you think about this?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two minor comments but the change looks good to me.

But I'm not sure it's right to include the discard_punctuation option in this PR.
Because this PR is for nori_number token filter.

I think it's ok since the discard_punctuation option was added specifically for the number token filter. Let's add both in the same pr, thanks for separating the commits though.

This filter does this kind of normalization and allows a search for 3200 to match 3.2천 in text,
but can also be used to make range facets based on the normalized numbers and so on.

Notice that this analyzer uses a token composition scheme and relies on punctuation tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add add a NOTE: to emphasize this part ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@danmuzi
Copy link
Contributor Author

danmuzi commented Mar 20, 2020

Thanks Jim.
I added a commit based on your review.
It's about adding NOTE in asciidoc and test cases for discard_punctuation.
Please check it!

@jimczi
Copy link
Contributor

jimczi commented Mar 23, 2020

@elasticmachine ok to test

@danmuzi
Copy link
Contributor Author

danmuzi commented Mar 23, 2020

I'm not sure why elasticsearch-ci/2 and elasticsearch-ci/bwc and elasticsearch-ci/default-distro are failed.
It doesn't seem related but I found an indentation problem in this PR.
And I rebased this patch to the latest master branch commit.

@danmuzi
Copy link
Contributor Author

danmuzi commented Mar 23, 2020

Because of the my rebase mistake, the previous Jenkins build history has been lost in this conversation.
So I attach the failure builds before the rebase.
elasticsearch-ci/2 : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/19332/
elasticsearch-ci/bwc : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-bwc/19021/
elasticsearch-ci/default-distro : https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request+default-distro/18876/

After the rebase, all Jenkins builds passed.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @danmuzi !

@danmuzi
Copy link
Contributor Author

danmuzi commented Mar 23, 2020

Thanks for your kind reviews! @jimczi

@jimczi jimczi merged commit 8d4ff29 into elastic:master Mar 23, 2020
jimczi pushed a commit that referenced this pull request Mar 23, 2020
This change adds the `nori_number` token filter.
It also adds a `discard_punctuation` option in nori_tokenizer that should be used in conjunction with the new filter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants