Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom characters in token_chars of ngram tokenizers #49250

Merged
merged 2 commits into from
Nov 20, 2019

Conversation

cbuescher
Copy link
Member

Currently the token_chars setting in both edgeNGram and ngram tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the punctuation class which comes with a lot of other characters.
This change adds an additional "custom" option to the token_chars setting,
which requires an additional custom_token_chars setting to be present and
which will be interpreted as a set of characters to inlcude into a token.

Closes #25894

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the `punctuation` class which comes with a lot of other characters.
This change adds an additional "custom" option to the `token_chars` setting,
which requires an additional `custom_token_chars` setting to be present and
which will be interpreted as a set of characters to inlcude into a token.

Closes elastic#25894
@cbuescher
Copy link
Member Author

@elasticmachine run elasticsearch-ci/packaging-sample-matrix

@cbuescher cbuescher requested a review from romseygeek November 19, 2019 14:06
Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit about the exception message but no need for another go-round.

@@ -76,7 +79,22 @@ static CharMatcher parseTokenChars(List<String> characterClasses) {
characterClass = characterClass.toLowerCase(Locale.ROOT).trim();
CharMatcher matcher = MATCHERS.get(characterClass);
if (matcher == null) {
throw new IllegalArgumentException("Unknown token type: '" + characterClass + "', must be one of " + MATCHERS.keySet());
if (characterClass.equals("custom") == false) {
throw new IllegalArgumentException("Unknown token type: '" + characterClass + "', must be one of " + MATCHERS.keySet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to include custom in the list here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, although the message here is a bit unwieldy as it is already. It seems to include a ton of unicode categories from the the java.lang.Character class via the MATCHERS, e.g. currently:

Unknown token type: 'letters', must be one of [symbol, private_use, paragraph_separator, start_punctuation, unassigned, enclosing_mark, connector_punctuation, letter_number, other_number, math_symbol, lowercase_letter, space_separator, surrogate, initial_quote_punctuation, decimal_digit_number, digit, other_punctuation, dash_punctuation, currency_symbol, non_spacing_mark, format, modifier_letter, control, uppercase_letter, other_symbol, end_punctuation, modifier_symbol, other_letter, line_separator, titlecase_letter, letter, punctuation, combining_spacing_mark, final_quote_punctuation, whitespace]

So the "custom" will be burried somewhere in there. If you have any ideas to improve this let me know, otherwise I'll merge this after I pushed the changes and tests passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@cbuescher cbuescher merged commit ed86750 into elastic:master Nov 20, 2019
cbuescher pushed a commit that referenced this pull request Nov 20, 2019
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the `punctuation` class which comes with a lot of other characters.
This change adds an additional "custom" option to the `token_chars` setting,
which requires an additional `custom_token_chars` setting to be present and
which will be interpreted as a set of characters to inlcude into a token.

Closes #25894
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow specific characters in token_chars of edge ngram tokenizer in addition to classes
4 participants