-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[DOCS] Reformat n-gram token filter docs (#49438)
Reformats the edge n-gram and n-gram token filter docs. Changes include: * Adds title abbreviations * Updates the descriptions and adds Lucene links * Reformats parameter definitions * Adds analyze and custom analyzer snippets * Adds notes explaining differences between the edge n-gram and n-gram filters Additional changes: * Switches titles to use "n-gram" throughout. * Fixes a typo in the edge n-gram tokenizer docs * Adds an explicit anchor for the `index.max_ngram_diff` setting
- Loading branch information
Showing
5 changed files
with
468 additions
and
28 deletions.
There are no files selected for viewing
250 changes: 239 additions & 11 deletions
250
docs/reference/analysis/tokenfilters/edgengram-tokenfilter.asciidoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,244 @@ | ||
[[analysis-edgengram-tokenfilter]] | ||
=== Edge NGram Token Filter | ||
=== Edge n-gram token filter | ||
++++ | ||
<titleabbrev>Edge n-gram</titleabbrev> | ||
++++ | ||
|
||
A token filter of type `edge_ngram`. | ||
Forms an https://en.wikipedia.org/wiki/N-gram[n-gram] of a specified length from | ||
the beginning of a token. | ||
|
||
The following are settings that can be set for a `edge_ngram` token | ||
filter type: | ||
For example, you can use the `edge_ngram` token filter to change `quick` to | ||
`qu`. | ||
|
||
[cols="<,<",options="header",] | ||
|====================================================== | ||
|Setting |Description | ||
|`min_gram` |Defaults to `1`. | ||
|`max_gram` |Defaults to `2`. | ||
|`side` |deprecated. Either `front` or `back`. Defaults to `front`. | ||
|====================================================== | ||
When not customized, the filter creates 1-character edge n-grams by default. | ||
|
||
This filter uses Lucene's | ||
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenFilter.html[EdgeNGramTokenFilter]. | ||
|
||
[NOTE] | ||
==== | ||
The `edge_ngram` filter is similar to the <<analysis-ngram-tokenizer,`ngram` | ||
token filter>>. However, the `edge_ngram` only outputs n-grams that start at the | ||
beginning of a token. These edge n-grams are useful for | ||
<<search-as-you-type,search-as-you-type>> queries. | ||
==== | ||
|
||
[[analysis-edgengram-tokenfilter-analyze-ex]] | ||
==== Example | ||
|
||
The following <<indices-analyze,analyze API>> request uses the `edge_ngram` | ||
filter to convert `the quick brown fox jumps` to 1-character and 2-character | ||
edge n-grams: | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
GET _analyze | ||
{ | ||
"tokenizer": "standard", | ||
"filter": [ | ||
{ "type": "edge_ngram", | ||
"min_gram": 1, | ||
"max_gram": 2 | ||
} | ||
], | ||
"text": "the quick brown fox jumps" | ||
} | ||
-------------------------------------------------- | ||
|
||
The filter produces the following tokens: | ||
|
||
[source,text] | ||
-------------------------------------------------- | ||
[ t, th, q, ui, b, br, f, fo, j, ju ] | ||
-------------------------------------------------- | ||
|
||
///////////////////// | ||
[source,console-result] | ||
-------------------------------------------------- | ||
{ | ||
"tokens" : [ | ||
{ | ||
"token" : "t", | ||
"start_offset" : 0, | ||
"end_offset" : 3, | ||
"type" : "<ALPHANUM>", | ||
"position" : 0 | ||
}, | ||
{ | ||
"token" : "th", | ||
"start_offset" : 0, | ||
"end_offset" : 3, | ||
"type" : "<ALPHANUM>", | ||
"position" : 0 | ||
}, | ||
{ | ||
"token" : "q", | ||
"start_offset" : 4, | ||
"end_offset" : 9, | ||
"type" : "<ALPHANUM>", | ||
"position" : 1 | ||
}, | ||
{ | ||
"token" : "qu", | ||
"start_offset" : 4, | ||
"end_offset" : 9, | ||
"type" : "<ALPHANUM>", | ||
"position" : 1 | ||
}, | ||
{ | ||
"token" : "b", | ||
"start_offset" : 10, | ||
"end_offset" : 15, | ||
"type" : "<ALPHANUM>", | ||
"position" : 2 | ||
}, | ||
{ | ||
"token" : "br", | ||
"start_offset" : 10, | ||
"end_offset" : 15, | ||
"type" : "<ALPHANUM>", | ||
"position" : 2 | ||
}, | ||
{ | ||
"token" : "f", | ||
"start_offset" : 16, | ||
"end_offset" : 19, | ||
"type" : "<ALPHANUM>", | ||
"position" : 3 | ||
}, | ||
{ | ||
"token" : "fo", | ||
"start_offset" : 16, | ||
"end_offset" : 19, | ||
"type" : "<ALPHANUM>", | ||
"position" : 3 | ||
}, | ||
{ | ||
"token" : "j", | ||
"start_offset" : 20, | ||
"end_offset" : 25, | ||
"type" : "<ALPHANUM>", | ||
"position" : 4 | ||
}, | ||
{ | ||
"token" : "ju", | ||
"start_offset" : 20, | ||
"end_offset" : 25, | ||
"type" : "<ALPHANUM>", | ||
"position" : 4 | ||
} | ||
] | ||
} | ||
-------------------------------------------------- | ||
///////////////////// | ||
|
||
[[analysis-edgengram-tokenfilter-analyzer-ex]] | ||
==== Add to an analyzer | ||
|
||
The following <<indices-create-index,create index API>> request uses the | ||
`edge_ngram` filter to configure a new | ||
<<analysis-custom-analyzer,custom analyzer>>. | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
PUT edge_ngram_example | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"analyzer": { | ||
"standard_edge_ngram": { | ||
"tokenizer": "standard", | ||
"filter": [ "edge_ngram" ] | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
|
||
[[analysis-edgengram-tokenfilter-configure-parms]] | ||
==== Configurable parameters | ||
|
||
`max_gram`:: | ||
+ | ||
-- | ||
(Optional, integer) | ||
Maximum character length of a gram. For custom token filters, defaults to `2`. | ||
For the built-in `edge_ngram` filter, defaults to `1`. | ||
|
||
See <<analysis-edgengram-tokenfilter-max-gram-limits>>. | ||
-- | ||
|
||
`min_gram`:: | ||
(Optional, integer) | ||
Minimum character length of a gram. Defaults to `1`. | ||
|
||
`side`:: | ||
+ | ||
-- | ||
(Optional, string) | ||
Deprecated. Indicates whether to truncate tokens from the `front` or `back`. | ||
Defaults to `front`. | ||
|
||
Instead of using the `back` value, you can use the | ||
<<analysis-reverse-tokenfilter,`reverse`>> token filter before and after the | ||
`edge_ngram` filter to achieve the same results. | ||
-- | ||
|
||
[[analysis-edgengram-tokenfilter-customize]] | ||
==== Customize | ||
|
||
To customize the `edge_ngram` filter, duplicate it to create the basis | ||
for a new custom token filter. You can modify the filter using its configurable | ||
parameters. | ||
|
||
For example, the following request creates a custom `edge_ngram` | ||
filter that forms n-grams between 3-5 characters. | ||
|
||
[source,console] | ||
-------------------------------------------------- | ||
PUT edge_ngram_custom_example | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"analyzer": { | ||
"default": { | ||
"tokenizer": "whitespace", | ||
"filter": [ "3_5_edgegrams" ] | ||
} | ||
}, | ||
"filter": { | ||
"3_5_edgegrams": { | ||
"type": "edge_ngram", | ||
"min_gram": 3, | ||
"max_gram": 5 | ||
} | ||
} | ||
} | ||
} | ||
} | ||
-------------------------------------------------- | ||
|
||
[[analysis-edgengram-tokenfilter-max-gram-limits]] | ||
==== Limitations of the `max_gram` parameter | ||
|
||
The `edge_ngram` filter's `max_gram` value limits the character length of | ||
tokens. When the `edge_ngram` filter is used with an index analyzer, this | ||
means search terms longer than the `max_gram` length may not match any indexed | ||
terms. | ||
|
||
For example, if the `max_gram` is `3`, searches for `apple` won't match the | ||
indexed term `app`. | ||
|
||
To account for this, you can use the | ||
<<analysis-truncate-tokenfilter,`truncate`>> filter with a search analyzer | ||
to shorten search terms to the `max_gram` character length. However, this could | ||
return irrelevant results. | ||
|
||
For example, if the `max_gram` is `3` and search terms are truncated to three | ||
characters, the search term `apple` is shortened to `app`. This means searches | ||
for `apple` return any indexed terms matching `app`, such as `apply`, `snapped`, | ||
and `apple`. | ||
|
||
We recommend testing both approaches to see which best fits your | ||
use case and desired search experience. |
Oops, something went wrong.