-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow custom characters in token_chars of ngram tokenizers #49250
Conversation
Pinging @elastic/es-search (:Search/Analysis) |
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes elastic#25894
@elasticmachine run elasticsearch-ci/packaging-sample-matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit about the exception message but no need for another go-round.
@@ -76,7 +79,22 @@ static CharMatcher parseTokenChars(List<String> characterClasses) { | |||
characterClass = characterClass.toLowerCase(Locale.ROOT).trim(); | |||
CharMatcher matcher = MATCHERS.get(characterClass); | |||
if (matcher == null) { | |||
throw new IllegalArgumentException("Unknown token type: '" + characterClass + "', must be one of " + MATCHERS.keySet()); | |||
if (characterClass.equals("custom") == false) { | |||
throw new IllegalArgumentException("Unknown token type: '" + characterClass + "', must be one of " + MATCHERS.keySet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to include custom
in the list here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, although the message here is a bit unwieldy as it is already. It seems to include a ton of unicode categories from the the java.lang.Character class via the MATCHERS, e.g. currently:
Unknown token type: 'letters', must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]
So the "custom" will be burried somewhere in there. If you have any ideas to improve this let me know, otherwise I'll merge this after I pushed the changes and tests passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers only allows for a list of predefined character classes, which might not fit every use case. For example, including underscore "_" in a token would currently require the `punctuation` class which comes with a lot of other characters. This change adds an additional "custom" option to the `token_chars` setting, which requires an additional `custom_token_chars` setting to be present and which will be interpreted as a set of characters to inlcude into a token. Closes #25894
Currently the
token_chars
setting in bothedgeNGram
andngram
tokenizersonly allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the
punctuation
class which comes with a lot of other characters.This change adds an additional "custom" option to the
token_chars
setting,which requires an additional
custom_token_chars
setting to be present andwhich will be interpreted as a set of characters to inlcude into a token.
Closes #25894