-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Reformat n-gram token filter docs #49438
[DOCS] Reformat n-gram token filter docs #49438
Conversation
Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting
Pinging @elastic/es-docs (>docs) |
Pinging @elastic/es-search (:Search/Analysis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jrodewig - I left one question and one comment.
`side`:: | ||
(Optional, string) | ||
Deprecated. Indicates whether to truncate tokens from the `front` or `back`. | ||
Defaults to `front`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a note here that rather than using side:back
, users should add a reverse
filter before and after this filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this suggestion. Added with 111bf9b.
-------------------------------------------------- | ||
[ t, q, b, f, j ] | ||
-------------------------------------------------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused here, as the default settings are min_gram
of 1 and max_gram
of 2, which should surely produce [ t, th, q, qu, b, br, f, fo, j, ju ]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with this a bit more in 8.0 and 7.4.2 and found some odd behavior.
The following _analyze request produces only unigrams:
GET _analyze
{
"tokenizer": "standard",
"filter": [ "edge_ngram" ],
"text": "the quick brown fox jumps"
}
However, treating edge_ngram
as a custom filter with the standard defaults produces both unigrams and bigrams:
GET _analyze
{
"tokenizer": "standard",
"filter": [
{ "type": "edge_ngram" }
],
"text": "the quick brown fox jumps"
}
I updated the analyze example to use the custom filter format with aeab02b.
If you can, let me know if this is a bug, undocumented but expected behavior, or just my misunderstanding of how the _analyze
API works. I'm happy to create a bug issue or document this behavior if needed.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a discrepancy between the pre-configured token filter and the default settings for a custom filter; the pre-configured filter uses min & max gram of 1, but the custom defaults are 1 and 2. It's a bit weird, but it's always been done like that, so I guess we just explicitly call it out in the documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @romseygeek. Added with dfd1e3b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting
Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting
Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting
Reformats the edge n-gram and n-gram token filter docs as part of #44726. Changes include:
filters
Supporting changes:
index.max_ngram_diff
setting