From 9eccb1d1ff7d6540972fd03592578491fb19ad40 Mon Sep 17 00:00:00 2001 From: Steve Murphy Date: Fri, 30 Sep 2022 14:19:51 -0600 Subject: [PATCH] Add analyzer endpoints Signed-off-by: Steve Murphy --- _opensearch/rest-api/analyze-apis/index.md | 13 + .../analyze-apis/perform-text-analysis.md | 661 ++++++++++++++++++ .../rest-api/analyze-apis/terminology.md | 37 + 3 files changed, 711 insertions(+) create mode 100644 _opensearch/rest-api/analyze-apis/index.md create mode 100644 _opensearch/rest-api/analyze-apis/perform-text-analysis.md create mode 100644 _opensearch/rest-api/analyze-apis/terminology.md diff --git a/_opensearch/rest-api/analyze-apis/index.md b/_opensearch/rest-api/analyze-apis/index.md new file mode 100644 index 0000000000..e539bb74c5 --- /dev/null +++ b/_opensearch/rest-api/analyze-apis/index.md @@ -0,0 +1,13 @@ +--- +layout: default +title: Analyze APIs +parent: REST API reference +has_children: true +nav_order: 7 +redirect_from: + - /opensearch/rest-api/analyze-apis/ +--- + +# Analyze APIs + +The analyze APIs allow you to perform text analysis, which is the process of converting unstructured text into individual tokens (usually words) that are optimized for search. \ No newline at end of file diff --git a/_opensearch/rest-api/analyze-apis/perform-text-analysis.md b/_opensearch/rest-api/analyze-apis/perform-text-analysis.md new file mode 100644 index 0000000000..2a90cd575c --- /dev/null +++ b/_opensearch/rest-api/analyze-apis/perform-text-analysis.md @@ -0,0 +1,661 @@ +--- +layout: default +title: Perform text analysis +parent: Analyze APIs +grand_parent: REST API reference +nav_order: 2 +--- + +## Perform text analysis + +Analyzes a text string and returns the resulting tokens. + +If you use the security plugin, you must have the `manage index` privilege. If you simply want to analyze text, you must have the `manager cluster` privilege. +{: .note} + +OpenSearch provides these text analysis endpoints: + +`GET /_analyze` + +`GET /{index}/_analyze` + +`POST /_analyze` + +`POST /{index}/_analyze` + +Although you can issue anlayzer requests via both `GET` and `POST` requests, the two have important distinctions. A `GET` request causes data to be cached in the index so the next time the data is requested, it is retrieved faster. A `POST` request sends a string that does not already exist to the analyzer to be compared to data already in the index. `POST` requests are not cached. +{: .note} + +### Path parameters + +| Parameter | Data Type | Description | +:--- | :--- | :--- +| index | String | Index that is used to derive the analyser. Optional. | + +### Request fields + +| Field | Data Type | Description | +:--- | :--- | :--- +| analyzer | String | The name of the analyzer to apply to the `text` field. The analyzer can be built-in or configured in the index.

If `analyzer` is not specified, the analyze API uses the analyzer defined in the mapping of the `field` field.

If the `field` field is not specified, the analyze API uses the default analyzer for the index.

If no index is specified or the index does not have a default analyzer, the analyze API uses the standard analyzer.

Optional. | +| attributes | Array of Strings | Array of token attributes for filtering the output of the `explain` field.

Optional. | +| char_filter | Array of Strings | Array of character filters for preprocessing characters before the `tokenizer` field.

Optional. | +| explain | Boolean | If true, causes the response to include token attributes and additional details. Defaults to `false`.

Optional. | +| field | String | Field for deriving the analyzer.

If you specify `field`, you must also specify the `index` path parameter.

If you specify the `analyzer` field, it overrides the value of `field`.

If you do not specify `field`, the analyze API uses the default analyzer for the index.

If you do not specify the `index` field, or the index does not have a default analyzer, the analyze API uses the standard analyzer.

Optional. | +| filter | Array of Strings | Array of token filters to apply after the `tokenizer` field.

Optional. | +| normalizer | String | Normalizer for converting text into a single token.

Optional. | +| text | String or Array of Strings | Text to analyze. If you provide an array of strings, it is analyzed as a multi-value field.

Required.| +| tokenizer | String | Tokenizer for converting the `text` field into tokens.

Optional. | + +#### Sample requests + +[Analyze array of text strings](#analyze-array-of-text-strings) + +[Apply a built-in analyzer](#apply-a-built-in-analyzer) + +[Apply a custom analyzer](#apply-a-custom-analyzer) + +[Apply a custom transient analyzer](#apply-a-custom-transient-analyzer) + +[Specify an index](#specify-an-index) + +[Derive the analyzer from an index field](#derive-the-analyzer-from-an-index-field) + +[Specify a normalizer](#specify-a-normalizer) + +[Get token details](#get-token-details) + +[Set a token limit](#set-a-token-limit) + +#### Analyze array of text strings + +When you pass an array of strings to the `text` field it is analyzed as a multi-value field. + +````json +GET /_analyze +{ + "analyzer" : "standard", + "text" : ["first array element", "second array element"] +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "first", + "start_offset" : 0, + "end_offset" : 5, + "type" : "", + "position" : 0 + }, + { + "token" : "array", + "start_offset" : 6, + "end_offset" : 11, + "type" : "", + "position" : 1 + }, + { + "token" : "element", + "start_offset" : 12, + "end_offset" : 19, + "type" : "", + "position" : 2 + }, + { + "token" : "second", + "start_offset" : 20, + "end_offset" : 26, + "type" : "", + "position" : 3 + }, + { + "token" : "array", + "start_offset" : 27, + "end_offset" : 32, + "type" : "", + "position" : 4 + }, + { + "token" : "element", + "start_offset" : 33, + "end_offset" : 40, + "type" : "", + "position" : 5 + } + ] +} +```` + +#### Apply a built-in analyzer + +If you omit the `index` path parameter, you can apply any of the built-in analyzers to the text string. + +The following request analyzes text using the `standard` built-in analyzer: + +````json +GET /_analyze +{ + "analyzer" : "standard", + "text" : "OpenSearch text analysis" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "opensearch", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "text", + "start_offset" : 11, + "end_offset" : 15, + "type" : "", + "position" : 1 + }, + { + "token" : "analysis", + "start_offset" : 16, + "end_offset" : 24, + "type" : "", + "position" : 2 + } + ] +} +```` + +#### Apply a custom analyzer + +You can create your own analyzer and specify it in an analyze request. + +In this scenario, a custom analyzer `lowercase_ascii_folding` has been created and associated with the `books2` index. The analyzer converts text to lower case and converts non-ascii characters to ascii. + +The following request applies the custom analyzer to the provided text: + +````json +GET /books2/_analyze +{ + "analyzer": "lowercase_ascii_folding", + "text" : "Le garçon m'a SUIVI." +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "le", + "start_offset" : 0, + "end_offset" : 2, + "type" : "", + "position" : 0 + }, + { + "token" : "garcon", + "start_offset" : 3, + "end_offset" : 9, + "type" : "", + "position" : 1 + }, + { + "token" : "m'a", + "start_offset" : 10, + "end_offset" : 13, + "type" : "", + "position" : 2 + }, + { + "token" : "suivi", + "start_offset" : 14, + "end_offset" : 19, + "type" : "", + "position" : 3 + } + ] +} +```` + +#### Apply a custom transient analyzer + +You can build a custom transient analyzer from tokenizers, token filters, and character filters. Use the `filter` parameter to specify token filters. + +The following request uses the `uppercase` character filter to convert the text to upper case: + +````json +GET /_analyze +{ + "tokenizer" : "keyword", + "filter" : ["uppercase"], + "text" : "OpenSearch filter" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "OPENSEARCH FILTER", + "start_offset" : 0, + "end_offset" : 17, + "type" : "word", + "position" : 0 + } + ] +} +```` +
+ +The following request uses the `html_strip` filter to remove html characters from the text: + +````json +GET /_analyze +{ + "tokenizer" : "keyword", + "filter" : ["lowercase"], + "char_filter" : ["html_strip"], + "text" : "Leave right now!" +} +```` + +The previous request returns the following fields: + +```` json +{ + "tokens" : [ + { + "token" : "leave right now!", + "start_offset" : 3, + "end_offset" : 23, + "type" : "word", + "position" : 0 + } + ] +} +```` + +
+ +You can combine filters using an array. + +The following request combines a `lowercase` translation with a `stop` filter that removes the words in the `stopwords` array: + +````json +GET /_analyze +{ + "tokenizer" : "whitespace", + "filter" : ["lowercase", {"type": "stop", "stopwords": [ "to", "in"]}], + "text" : "how to train your dog in five steps" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "how", + "start_offset" : 0, + "end_offset" : 3, + "type" : "word", + "position" : 0 + }, + { + "token" : "train", + "start_offset" : 7, + "end_offset" : 12, + "type" : "word", + "position" : 2 + }, + { + "token" : "your", + "start_offset" : 13, + "end_offset" : 17, + "type" : "word", + "position" : 3 + }, + { + "token" : "dog", + "start_offset" : 18, + "end_offset" : 21, + "type" : "word", + "position" : 4 + }, + { + "token" : "five", + "start_offset" : 25, + "end_offset" : 29, + "type" : "word", + "position" : 6 + }, + { + "token" : "steps", + "start_offset" : 30, + "end_offset" : 35, + "type" : "word", + "position" : 7 + } + ] +} +```` + +#### Specify an index + +You can analyze text using an index's default analyzer or you can specify a different analyzer. + +The following request analyzes the provided text using the default analyzer associated with the `books` index: + +````json +GET /books/_analyze +{ + "text" : "OpenSearch analyze test" +} +```` + +The previous request returns the following fields: + +````json + + "tokens" : [ + { + "token" : "opensearch", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "analyze", + "start_offset" : 11, + "end_offset" : 18, + "type" : "", + "position" : 1 + }, + { + "token" : "test", + "start_offset" : 19, + "end_offset" : 23, + "type" : "", + "position" : 2 + } + ] +} +```` + +
+ +The following request analyzes the provided text using the `keyword` analyzer, which returns the entire text value as a single token: + +````json +GET /books/_analyze +{ + "analyzer" : "keyword", + "text" : "OpenSearch analyze test" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "OpenSearch analyze test", + "start_offset" : 0, + "end_offset" : 23, + "type" : "word", + "position" : 0 + } + ] +} +```` + +#### Derive the analyzer from an index field + +You pass text and a field in the index. The API looks up the field's analyzer and uses that analyzer to analyes the text. + +If the mapping does not exist, the API uses the standard analyzer, which converts all text to lower case and tokenizes based on white space. + +The following request causes the analysis to be based on the mapping for `name`. + +````json +GET /books2/_analyze +{ + "field" : "name", + "text" : "OpenSearch analyze test" +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "opensearch", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "analyze", + "start_offset" : 11, + "end_offset" : 18, + "type" : "", + "position" : 1 + }, + { + "token" : "test", + "start_offset" : 19, + "end_offset" : 23, + "type" : "", + "position" : 2 + } + ] +} +```` + +#### Specify a normalizer + +Instead of using a keyword field, you can use the normalizer associated with the index. A normalizer causes the analysis change to produce a single token. + +In this example, the `books2` index includes a normalizer called `to_lower_fold_ascii` that converts text to lower case and translates non-ascii text to ascii. + +The following request applies `to_lower_fold_ascii` to the text. + +````json +GET /books2/_analyze +{ + "normalizer" : "to_lower_fold_ascii", + "text" : "C'est le garçon qui m'a suivi." +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "c'est le garcon qui m'a suivi.", + "start_offset" : 0, + "end_offset" : 30, + "type" : "word", + "position" : 0 + } + ] +} +```` + +
+ +You can create a custom transient normalizer with token and character filters. + +The following request uses the `uppercase` character filter to convert the given text to all upper case: + +````json +GET /_analyze +{ + "filter" : ["uppercase"], + "text" : "That is the boy who followed me." +} +```` + +The previous request returns the following fields: + +````json +{ + "tokens" : [ + { + "token" : "THAT IS THE BOY WHO FOLLOWED ME.", + "start_offset" : 0, + "end_offset" : 32, + "type" : "word", + "position" : 0 + } + ] +} +```` + +#### Get token details + +You can obtain additional details for all tokens by setting the `explain` attribute to `true`. + +The following request provides detail token information for the `reverse` filter used with the `standard` tokenizer. + +````json +GET /_analyze +{ + "tokenizer" : "standard", + "filter" : ["reverse"], + "text" : "OpenSearch analyze test", + "explain" : true, + "attributes" : ["keyword"] +} +```` + +The previous request returns the following fields: + +````json +{ + "detail" : { + "custom_analyzer" : true, + "charfilters" : [ ], + "tokenizer" : { + "name" : "standard", + "tokens" : [ + { + "token" : "OpenSearch", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "analyze", + "start_offset" : 11, + "end_offset" : 18, + "type" : "", + "position" : 1 + }, + { + "token" : "test", + "start_offset" : 19, + "end_offset" : 23, + "type" : "", + "position" : 2 + } + ] + }, + "tokenfilters" : [ + { + "name" : "reverse", + "tokens" : [ + { + "token" : "hcraeSnepO", + "start_offset" : 0, + "end_offset" : 10, + "type" : "", + "position" : 0 + }, + { + "token" : "ezylana", + "start_offset" : 11, + "end_offset" : 18, + "type" : "", + "position" : 1 + }, + { + "token" : "tset", + "start_offset" : 19, + "end_offset" : 23, + "type" : "", + "position" : 2 + } + ] + } + ] + } +} +```` + +#### Set a token limit + +You can set a limit to the number of tokens generated. Setting lower vaues reduces a node's memory usage. The default value is 10000. + +The following request limits the tokens to four: + +````json +PUT /books2 +{ + "settings" : { + "index.analyze.max_token_count" : 4 + } +} +```` +The preceding request is an index API rather than an analyze API. See [DYNAMIC INDEX SETTINGS]({{site.url}}{{site.baseurl}}/opensearch/rest-api/index-apis/create-index/#dynamic-index-settings) for additional details. +{: .note} + +### Response fields + +The text analysis endpoints return the following response fields: + +| Field | Data Type | Description | +:--- | :--- | :--- +| tokens | Array | Array of tokens derived from the `text`. See [token object](#token-object). | +| detail | Object | Details about the analysis and each token. Included only when you request token details. See [detail object](#detail-object). | + +#### token object + +| Field | Data Type | Description | +:--- | :--- | :--- +| token | String | The token's text. | +| start_offset | Integer | Token's starting position within the original text string. Offsets are zero-based. | +| end_offset | Integer | Token's ending position within the original text string. | +| type | String | Classification of the token. ``, ``, and so on. The tokenizer usually sets the type, but some filters define their own types. For example, the synonym filter defines the `` type. | +| position | Integer | Token's position within the `tokens` array. | + +#### detail object + +| Field | Data Type | Description | +:--- | :--- | :--- +| custom_analyzer | Boolean | Whether the analyzer applied to the text is custom or built-in. | +| charfilters | Array | List of character filters applied to the text. | +| tokenizer | Object | Name of tokenizer applied to the text and list of tokens* with content before token filters were applied. | +| tokenfilters | Array | List of token filters applied to the text. Each token filter includes the filter's name and a list of tokens* with content after the filters were applied. Token filters are listed in the order they are specified in the request. | + +*See [token object](#token-object) for token field descriptions. \ No newline at end of file diff --git a/_opensearch/rest-api/analyze-apis/terminology.md b/_opensearch/rest-api/analyze-apis/terminology.md new file mode 100644 index 0000000000..7770b24e1e --- /dev/null +++ b/_opensearch/rest-api/analyze-apis/terminology.md @@ -0,0 +1,37 @@ +--- +layout: default +title: Analysis API Terminology +parent: Analyze APIs +grand_parent: REST API reference +nav_order: 1 +--- + +## Terminology + +The following sections provide descriptions of important text analysis terms. + +### Analyzers + +Analyzers instruct OpenSearch how to index and search text. Analyzers comprise three components: a tokenizer, zero or more token filters, and zero or more character filters. + +OpenSearch provides *built-in* analyzers. For example, the `standard` built-in analyzer converts text to lower case and breaks text into tokens based on word boundaries such as carriage returns and white space. The `standard` analyzer is also called the *default* analyzer and is used when no analyzer is specified in the text analysis request. + +If needed, you can combine tokenizers, token filters, and character filters to create a *custom* analyzer. + +#### Tokenizers + +Tokenizers break unstuctured text into tokens and maintain metadata about tokens such as their start and ending positions in the text. + +#### Character filters + +Character filters examine textand perform translations such as changing, removing, and adding characters. + +#### Token filters + +Token filters modify tokens, performing operations such converting a token's characters to upper case, and adding or removing tokens. + +### Normalizers + +Similar to analyzers, normalizers tokenize text, but return a single token only. Normalizers do not employ tokenizers, and make limited use of character and token filters, such as those that operate on one character at a time. + +By default, OpenSearch does not apply normalizers. To apply a normalizers, you must add them to your data before creating an index. \ No newline at end of file