[DOCS] Reformat n-gram token filter docs #49438

jrodewig · 2019-11-21T14:32:03Z

Reformats the edge n-gram and n-gram token filter docs as part of #44726. Changes include:

Adds title abbreviations
Updates the descriptions and adds Lucene links
Reformats parameter definitions
Adds analyze and custom analyzer snippets
Adds notes explaining differences between the edge n-gram and n-gram
filters

Supporting changes:

Switches titles to use "n-gram" throughout.
Fixes a typo in the edge n-gram tokenizer docs
Adds an explicit anchor for the index.max_ngram_diff setting

Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting

elasticmachine · 2019-11-21T14:32:05Z

Pinging @elastic/es-docs (>docs)

elasticmachine · 2019-11-21T14:32:07Z

Pinging @elastic/es-search (:Search/Analysis)

romseygeek

Thanks @jrodewig - I left one question and one comment.

romseygeek · 2019-11-22T09:46:52Z

docs/reference/analysis/tokenfilters/edgengram-tokenfilter.asciidoc

+`side`::
+(Optional, string)
+Deprecated. Indicates whether to truncate tokens from the `front` or `back`.
+Defaults to `front`.


Maybe add a note here that rather than using side:back, users should add a reverse filter before and after this filter.

Thanks for this suggestion. Added with 111bf9b.

romseygeek · 2019-11-22T09:46:57Z

docs/reference/analysis/tokenfilters/edgengram-tokenfilter.asciidoc

+--------------------------------------------------
+[ t, q, b, f, j ]
+--------------------------------------------------
+


I'm a bit confused here, as the default settings are min_gram of 1 and max_gram of 2, which should surely produce [ t, th, q, qu, b, br, f, fo, j, ju ]?

I experimented with this a bit more in 8.0 and 7.4.2 and found some odd behavior.

The following _analyze request produces only unigrams:

GET _analyze { "tokenizer": "standard", "filter": [ "edge_ngram" ], "text": "the quick brown fox jumps" }

However, treating edge_ngram as a custom filter with the standard defaults produces both unigrams and bigrams:

GET _analyze { "tokenizer": "standard", "filter": [ { "type": "edge_ngram" } ], "text": "the quick brown fox jumps" }

I updated the analyze example to use the custom filter format with aeab02b.

If you can, let me know if this is a bug, undocumented but expected behavior, or just my misunderstanding of how the _analyze API works. I'm happy to create a bug issue or document this behavior if needed.

Thanks!

This looks like a discrepancy between the pre-configured token filter and the default settings for a custom filter; the pre-configured filter uses min & max gram of 1, but the custom defaults are 1 and 2. It's a bit weird, but it's always been done like that, so I guess we just explicitly call it out in the documentation?

Thanks @romseygeek. Added with dfd1e3b.

romseygeek

LGTM!

Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting

jrodewig added >docs General docs changes :Search Relevance/Analysis How text is split into tokens v8.0.0 v7.6.0 v7.4.3 v7.5.1 labels Nov 21, 2019

jrodewig requested a review from romseygeek November 21, 2019 14:32

kat257 mentioned this pull request Nov 21, 2019

[DOCS] Reorganize, rewrite and add examples to analysis topics #44726

Closed

82 tasks

romseygeek reviewed Nov 22, 2019

View reviewed changes

jrodewig added 3 commits November 22, 2019 08:49

Add reverse filter note

111bf9b

Update analyze example

aeab02b

Document default for built-in edge_ngram filter

dfd1e3b

romseygeek approved these changes Nov 22, 2019

View reviewed changes

jrodewig merged commit ddf5c0a into elastic:master Nov 22, 2019

jrodewig deleted the reformat.gram-token-filters branch November 22, 2019 15:38

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] Reformat n-gram token filter docs #49438

[DOCS] Reformat n-gram token filter docs #49438

jrodewig commented Nov 21, 2019

elasticmachine commented Nov 21, 2019

elasticmachine commented Nov 21, 2019

romseygeek left a comment

romseygeek Nov 22, 2019

jrodewig Nov 22, 2019 •

edited

Loading

romseygeek Nov 22, 2019

jrodewig Nov 22, 2019 •

edited

Loading

romseygeek Nov 22, 2019

jrodewig Nov 22, 2019

romseygeek left a comment

[DOCS] Reformat n-gram token filter docs #49438

[DOCS] Reformat n-gram token filter docs #49438

Conversation

jrodewig commented Nov 21, 2019

elasticmachine commented Nov 21, 2019

elasticmachine commented Nov 21, 2019

romseygeek left a comment

Choose a reason for hiding this comment

romseygeek Nov 22, 2019

Choose a reason for hiding this comment

jrodewig Nov 22, 2019 • edited Loading

Choose a reason for hiding this comment

romseygeek Nov 22, 2019

Choose a reason for hiding this comment

jrodewig Nov 22, 2019 • edited Loading

Choose a reason for hiding this comment

romseygeek Nov 22, 2019

Choose a reason for hiding this comment

jrodewig Nov 22, 2019

Choose a reason for hiding this comment

romseygeek left a comment

Choose a reason for hiding this comment

jrodewig Nov 22, 2019 •

edited

Loading

jrodewig Nov 22, 2019 •

edited

Loading