-
Notifications
You must be signed in to change notification settings - Fork 25.7k
New Annotated_text field type #30364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
3b27e0b
New plugin for annotated_text field type.
markharwood f469116
Added analyze api example
markharwood 9afa0f2
Addressed review comments. Removed dependency on ThreadLocal in parsi…
markharwood f9fbf36
Bring in support for MultiPhraseQuery (copied from TextFieldType)
markharwood 75ced30
Fix for checkstyle violation
markharwood 75cbcc9
Remove eagerGlobalOrdinals setting, change docs example type “_doc”
markharwood 12fe50d
Addressing review comments - add UncheckedIOException, reject unindex…
markharwood e1256da
Unused import
markharwood a8cb0e9
Added “testNotIndexedField” test and removed irrelevant test.
markharwood 43720ee
Remove support for type=value syntax in annotations. We now throw Ela…
markharwood b3b09ef
Remove types from test
markharwood ce7d879
Remove types from Tests
markharwood 6dd2085
Fix for changes to TotalHits
markharwood c3e8b9f
Address @romseygeek review comments. Removed newlines and removed Ann…
markharwood 0a5422c
Moved static getAnalyzer to HighlightUtils
markharwood File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,328 @@ | ||
| [[mapper-annotated-text]] | ||
| === Mapper Annotated Text Plugin | ||
|
|
||
| experimental[] | ||
|
|
||
| The mapper-annotated-text plugin provides the ability to index text that is a | ||
| combination of free-text and special markup that is typically used to identify | ||
| items of interest such as people or organisations (see NER or Named Entity Recognition | ||
| tools). | ||
|
|
||
|
|
||
| The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token | ||
| stream at the same position as the underlying text it annotates. | ||
|
|
||
| :plugin_name: mapper-annotated-text | ||
| include::install_remove.asciidoc[] | ||
|
|
||
| [[mapper-annotated-text-usage]] | ||
| ==== Using the `annotated-text` field | ||
|
|
||
| The `annotated-text` tokenizes text content as per the more common `text` field (see | ||
| "limitations" below) but also injects any marked-up annotation tokens directly into | ||
| the search index: | ||
|
|
||
| [source,js] | ||
| -------------------------- | ||
| PUT my_index | ||
| { | ||
| "mappings": { | ||
| "_doc": { | ||
| "properties": { | ||
| "my_field": { | ||
| "type": "annotated_text" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| -------------------------- | ||
| // CONSOLE | ||
|
|
||
| Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text | ||
| and structured tokens. The annotations use a markdown-like syntax using URL encoding of | ||
| one or more values separated by the `&` symbol. | ||
|
|
||
|
|
||
| We can use the "_analyze" api to test how an example annotation would be stored as tokens | ||
| in the search index: | ||
|
|
||
|
|
||
| [source,js] | ||
| -------------------------- | ||
| GET my_index/_analyze | ||
| { | ||
| "field": "my_field", | ||
| "text":"Investors in [Apple](Apple+Inc.) rejoiced." | ||
| } | ||
| -------------------------- | ||
| // NOTCONSOLE | ||
|
|
||
| Response: | ||
|
|
||
| [source,js] | ||
| -------------------------------------------------- | ||
| { | ||
| "tokens": [ | ||
| { | ||
| "token": "investors", | ||
| "start_offset": 0, | ||
| "end_offset": 9, | ||
| "type": "<ALPHANUM>", | ||
| "position": 0 | ||
| }, | ||
| { | ||
| "token": "in", | ||
| "start_offset": 10, | ||
| "end_offset": 12, | ||
| "type": "<ALPHANUM>", | ||
| "position": 1 | ||
| }, | ||
| { | ||
| "token": "Apple Inc.", <1> | ||
| "start_offset": 13, | ||
| "end_offset": 18, | ||
| "type": "annotation", | ||
| "position": 2 | ||
| }, | ||
| { | ||
| "token": "apple", | ||
| "start_offset": 13, | ||
| "end_offset": 18, | ||
| "type": "<ALPHANUM>", | ||
| "position": 2 | ||
| }, | ||
| { | ||
| "token": "rejoiced", | ||
| "start_offset": 19, | ||
| "end_offset": 27, | ||
| "type": "<ALPHANUM>", | ||
| "position": 3 | ||
| } | ||
| ] | ||
| } | ||
| -------------------------------------------------- | ||
| // NOTCONSOLE | ||
|
|
||
| <1> Note the whole annotation token `Apple Inc.` is placed, unchanged as a single token in | ||
| the token stream and at the same position (position 2) as the text token (`apple`) it annotates. | ||
|
|
||
|
|
||
| We can now perform searches for annotations using regular `term` queries that don't tokenize | ||
| the provided search values. Annotations are a more precise way of matching as can be seen | ||
| in this example where a search for `Beck` will not match `Jeff Beck` : | ||
|
|
||
| [source,js] | ||
| -------------------------- | ||
| # Example documents | ||
| PUT my_index/_doc/1 | ||
| { | ||
| "my_field": "[Beck](Beck) announced a new tour"<2> | ||
| } | ||
|
|
||
| PUT my_index/_doc/2 | ||
| { | ||
| "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"<1> | ||
| } | ||
|
|
||
| # Example search | ||
| GET my_index/_search | ||
| { | ||
| "query": { | ||
| "term": { | ||
| "my_field": "Beck" <3> | ||
| } | ||
| } | ||
| } | ||
| -------------------------- | ||
| // CONSOLE | ||
|
|
||
| <1> As well as tokenising the plain text into single words e.g. `beck`, here we | ||
| inject the single token value `Beck` at the same position as `beck` in the token stream. | ||
| <2> Note annotations can inject multiple tokens at the same position - here we inject both | ||
| the very specific value `Jeff Beck` and the broader term `Guitarist`. This enables | ||
| broader positional queries e.g. finding mentions of a `Guitarist` near to `strat`. | ||
| <3> A benefit of searching with these carefully defined annotation tokens is that a query for | ||
| `Beck` will not match document 2 that contains the tokens `jeff`, `beck` and `Jeff Beck` | ||
|
|
||
| WARNING: Any use of `=` signs in annotation values eg `[Prince](person=Prince)` will | ||
| cause the document to be rejected with a parse failure. In future we hope to have a use for | ||
| the equals signs so wil actively reject documents that contain this today. | ||
|
|
||
|
|
||
| [[mapper-annotated-text-tips]] | ||
| ==== Data modelling tips | ||
| ===== Use structured and unstructured fields | ||
|
|
||
| Annotations are normally a way of weaving structured information into unstructured text for | ||
| higher-precision search. | ||
|
|
||
| `Entity resolution` is a form of document enrichment undertaken by specialist software or people | ||
| where references to entities in a document are disambiguated by attaching a canonical ID. | ||
| The ID is used to resolve any number of aliases or distinguish between people with the | ||
| same name. The hyperlinks connecting Wikipedia's articles are a good example of resolved | ||
| entity IDs woven into text. | ||
|
|
||
| These IDs can be embedded as annotations in an annotated_text field but it often makes | ||
| sense to include them in dedicated structured fields to support discovery via aggregations: | ||
|
|
||
| [source,js] | ||
| -------------------------- | ||
| PUT my_index | ||
| { | ||
| "mappings": { | ||
| "_doc": { | ||
| "properties": { | ||
| "my_unstructured_text_field": { | ||
| "type": "annotated_text" | ||
| }, | ||
| "my_structured_people_field": { | ||
| "type": "text", | ||
| "fields": { | ||
| "keyword" :{ | ||
| "type": "keyword" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| -------------------------- | ||
| // CONSOLE | ||
|
|
||
| Applications would then typically provide content and discover it as follows: | ||
|
|
||
| [source,js] | ||
| -------------------------- | ||
| # Example documents | ||
| PUT my_index/_doc/1 | ||
| { | ||
| "my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch", | ||
| "my_twitter_handles": ["@kimchy"] <1> | ||
| } | ||
|
|
||
| GET my_index/_search | ||
| { | ||
| "query": { | ||
| "query_string": { | ||
| "query": "elasticsearch OR logstash OR kibana",<2> | ||
| "default_field": "my_unstructured_text_field" | ||
| } | ||
| }, | ||
| "aggregations": { | ||
| "top_people" :{ | ||
| "significant_terms" : { <3> | ||
| "field" : "my_twitter_handles.keyword" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| -------------------------- | ||
| // CONSOLE | ||
|
|
||
| <1> Note the `my_twitter_handles` contains a list of the annotation values | ||
| also used in the unstructured text. (Note the annotated_text syntax requires escaping). | ||
| By repeating the annotation values in a structured field this application has ensured that | ||
| the tokens discovered in the structured field can be used for search and highlighting | ||
| in the unstructured field. | ||
| <2> In this example we search for documents that talk about components of the elastic stack | ||
| <3> We use the `my_twitter_handles` field here to discover people who are significantly | ||
| associated with the elastic stack. | ||
|
|
||
| ===== Avoiding over-matching annotations | ||
| By design, the regular text tokens and the annotation tokens co-exist in the same indexed | ||
| field but in rare cases this can lead to some over-matching. | ||
|
|
||
| The value of an annotation often denotes a _named entity_ (a person, place or company). | ||
| The tokens for these named entities are inserted untokenized, and differ from typical text | ||
| tokens because they are normally: | ||
|
|
||
| * Mixed case e.g. `Madonna` | ||
| * Multiple words e.g. `Jeff Beck` | ||
| * Can have punctuation or numbers e.g. `Apple Inc.` or `@kimchy` | ||
|
|
||
| This means, for the most part, a search for a named entity in the annotated text field will | ||
| not have any false positives e.g. when selecting `Apple Inc.` from an aggregation result | ||
| you can drill down to highlight uses in the text without "over matching" on any text tokens | ||
| like the word `apple` in this context: | ||
|
|
||
| the apple was very juicy | ||
|
|
||
| However, a problem arises if your named entity happens to be a single term and lower-case e.g. the | ||
| company `elastic`. In this case, a search on the annotated text field for the token `elastic` | ||
| may match a text document such as this: | ||
|
|
||
| he fired an elastic band | ||
|
|
||
| To avoid such false matches users should consider prefixing annotation values to ensure | ||
| they don't name clash with text tokens e.g. | ||
|
|
||
| [elastic](Company_elastic) released version 7.0 of the elastic stack today | ||
|
|
||
|
|
||
|
|
||
|
|
||
| [[mapper-annotated-text-highlighter]] | ||
| ==== Using the `annotated` highlighter | ||
|
|
||
| The `annotated-text` plugin includes a custom highlighter designed to mark up search hits | ||
| in a way which is respectful of the original markup: | ||
|
|
||
| [source,js] | ||
| -------------------------- | ||
| # Example documents | ||
| PUT my_index/_doc/1 | ||
| { | ||
| "my_field": "The cat sat on the [mat](sku3578)" | ||
| } | ||
|
|
||
| GET my_index/_search | ||
| { | ||
| "query": { | ||
| "query_string": { | ||
| "query": "cats" | ||
| } | ||
| }, | ||
| "highlight": { | ||
| "fields": { | ||
| "my_field": { | ||
| "type": "annotated", <1> | ||
| "require_field_match": false | ||
| } | ||
| } | ||
| } | ||
| } | ||
| -------------------------- | ||
| // CONSOLE | ||
| <1> The `annotated` highlighter type is designed for use with annotated_text fields | ||
|
|
||
| The annotated highlighter is based on the `unified` highlighter and supports the same | ||
| settings but does not use the `pre_tags` or `post_tags` parameters. Rather than using | ||
| html-like markup such as `<em>cat</em>` the annotated highlighter uses the same | ||
| markdown-like syntax used for annotations and injects a key=value annotation where `_hit_term` | ||
| is the key and the matched search term is the value e.g. | ||
|
|
||
| The [cat](_hit_term=cat) sat on the [mat](sku3578) | ||
|
|
||
| The annotated highlighter tries to be respectful of any existing markup in the original | ||
| text: | ||
|
|
||
| * If the search term matches exactly the location of an existing annotation then the | ||
| `_hit_term` key is merged into the url-like syntax used in the `(...)` part of the | ||
| existing annotation. | ||
| * However, if the search term overlaps the span of an existing annotation it would break | ||
| the markup formatting so the original annotation is removed in favour of a new annotation | ||
| with just the search hit information in the results. | ||
| * Any non-overlapping annotations in the original text are preserved in highlighter | ||
| selections | ||
|
|
||
|
|
||
| [[mapper-annotated-text-limitations]] | ||
| ==== Limitations | ||
|
|
||
| The annotated_text field type supports the same mapping settings as the `text` field type | ||
| but with the following exceptions: | ||
|
|
||
| * No support for `fielddata` or `fielddata_frequency_filter` | ||
| * No support for `index_prefixes` or `index_phrases` indexing | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| /* | ||
| * Licensed to Elasticsearch under one or more contributor | ||
| * license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright | ||
| * ownership. Elasticsearch licenses this file to you under | ||
| * the Apache License, Version 2.0 (the "License"); you may | ||
| * not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| esplugin { | ||
| description 'The Mapper Annotated_text plugin adds support for text fields with markup used to inject annotation tokens into the index.' | ||
| classname 'org.elasticsearch.plugin.mapper.AnnotatedTextPlugin' | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to mark this 'experimental', to allow for future breaking changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!