-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add nori_number token filter in analysis-nori #53583
Conversation
Pinging @elastic/es-search (:Search/Analysis) |
Pinging @elastic/es-docs (>docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this filter danmuzi
! I left one comment regarding the discard_punctuation
option of the tokenizer that was added to handle the number filter correctly.
Thanks for your review, @jimczi 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two minor comments but the change looks good to me.
But I'm not sure it's right to include the discard_punctuation option in this PR.
Because this PR is for nori_number token filter.
I think it's ok since the discard_punctuation option was added specifically for the number token filter. Let's add both in the same pr, thanks for separating the commits though.
plugins/analysis-nori/src/test/java/org/elasticsearch/index/analysis/NoriAnalysisTests.java
Show resolved
Hide resolved
This filter does this kind of normalization and allows a search for 3200 to match 3.2천 in text, | ||
but can also be used to make range facets based on the normalized numbers and so on. | ||
|
||
Notice that this analyzer uses a token composition scheme and relies on punctuation tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add add a NOTE:
to emphasize this part ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Thanks Jim. |
@elasticmachine ok to test |
3864137
to
1a4367e
Compare
I'm not sure why elasticsearch-ci/2 and elasticsearch-ci/bwc and elasticsearch-ci/default-distro are failed. |
Because of the my rebase mistake, the previous Jenkins build history has been lost in this conversation. After the rebase, all Jenkins builds passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @danmuzi !
Thanks for your kind reviews! @jimczi |
This change adds the `nori_number` token filter. It also adds a `discard_punctuation` option in nori_tokenizer that should be used in conjunction with the new filter.
The
KoreanNumberFilter
has included in Nori after Lucene 8.2.0. (LUCENE-8812)However, it isn't supported now in Nori Analysis plugin. (Kuromoji supports
kuromoji_number
)It seems to be missing(#30397) because
KoreanNumberFilter
didn't exist at Lucene 7.4.0 that supports Nori first time.This PR is for that.