Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Reformat n-gram token filter docs #49438

Merged
merged 4 commits into from
Nov 22, 2019
Merged

[DOCS] Reformat n-gram token filter docs #49438

merged 4 commits into from
Nov 22, 2019

Conversation

jrodewig
Copy link
Contributor

Reformats the edge n-gram and n-gram token filter docs as part of #44726. Changes include:

  • Adds title abbreviations
  • Updates the descriptions and adds Lucene links
  • Reformats parameter definitions
  • Adds analyze and custom analyzer snippets
  • Adds notes explaining differences between the edge n-gram and n-gram
    filters

Supporting changes:

  • Switches titles to use "n-gram" throughout.
  • Fixes a typo in the edge n-gram tokenizer docs
  • Adds an explicit anchor for the index.max_ngram_diff setting

Reformats the edge n-gram and n-gram token filter docs. Changes include:

* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds notes explaining differences between the edge n-gram and n-gram
  filters

Additional changes:
* Switches titles to use "n-gram" throughout.
* Fixes a typo in the edge n-gram tokenizer docs
* Adds an explicit anchor for the `index.max_ngram_diff` setting
@jrodewig jrodewig added >docs General docs changes :Search Relevance/Analysis How text is split into tokens v8.0.0 v7.6.0 v7.4.3 v7.5.1 labels Nov 21, 2019
@jrodewig jrodewig requested a review from romseygeek November 21, 2019 14:32
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (>docs)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Analysis)

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jrodewig - I left one question and one comment.

`side`::
(Optional, string)
Deprecated. Indicates whether to truncate tokens from the `front` or `back`.
Defaults to `front`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note here that rather than using side:back, users should add a reverse filter before and after this filter.

Copy link
Contributor Author

@jrodewig jrodewig Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion. Added with 111bf9b.

--------------------------------------------------
[ t, q, b, f, j ]
--------------------------------------------------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused here, as the default settings are min_gram of 1 and max_gram of 2, which should surely produce [ t, th, q, qu, b, br, f, fo, j, ju ]?

Copy link
Contributor Author

@jrodewig jrodewig Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with this a bit more in 8.0 and 7.4.2 and found some odd behavior.

The following _analyze request produces only unigrams:

GET _analyze
{
  "tokenizer": "standard",
  "filter": [ "edge_ngram" ],
  "text": "the quick brown fox jumps"
}

However, treating edge_ngram as a custom filter with the standard defaults produces both unigrams and bigrams:

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {  "type": "edge_ngram" }
  ],
  "text": "the quick brown fox jumps"
}

I updated the analyze example to use the custom filter format with aeab02b.

If you can, let me know if this is a bug, undocumented but expected behavior, or just my misunderstanding of how the _analyze API works. I'm happy to create a bug issue or document this behavior if needed.

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a discrepancy between the pre-configured token filter and the default settings for a custom filter; the pre-configured filter uses min & max gram of 1, but the custom defaults are 1 and 2. It's a bit weird, but it's always been done like that, so I guess we just explicitly call it out in the documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romseygeek. Added with dfd1e3b.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jrodewig jrodewig merged commit ddf5c0a into elastic:master Nov 22, 2019
@jrodewig jrodewig deleted the reformat.gram-token-filters branch November 22, 2019 15:38
jrodewig added a commit that referenced this pull request Nov 22, 2019
Reformats the edge n-gram and n-gram token filter docs. Changes include:

* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds notes explaining differences between the edge n-gram and n-gram
  filters

Additional changes:
* Switches titles to use "n-gram" throughout.
* Fixes a typo in the edge n-gram tokenizer docs
* Adds an explicit anchor for the `index.max_ngram_diff` setting
jrodewig added a commit that referenced this pull request Nov 22, 2019
Reformats the edge n-gram and n-gram token filter docs. Changes include:

* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds notes explaining differences between the edge n-gram and n-gram
  filters

Additional changes:
* Switches titles to use "n-gram" throughout.
* Fixes a typo in the edge n-gram tokenizer docs
* Adds an explicit anchor for the `index.max_ngram_diff` setting
jrodewig added a commit that referenced this pull request Nov 22, 2019
Reformats the edge n-gram and n-gram token filter docs. Changes include:

* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds notes explaining differences between the edge n-gram and n-gram
  filters

Additional changes:
* Switches titles to use "n-gram" throughout.
* Fixes a typo in the edge n-gram tokenizer docs
* Adds an explicit anchor for the `index.max_ngram_diff` setting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants