Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to optimize search with better analysis #27049

Closed
jpountz opened this issue Oct 19, 2017 · 10 comments
Closed

Make it easier to optimize search with better analysis #27049

jpountz opened this issue Oct 19, 2017 · 10 comments
Labels
>feature :Search/Search Search-related issues that do not fall into other categories

Comments

@jpountz
Copy link
Contributor

jpountz commented Oct 19, 2017

We have a number of filters that can help make search faster:

  • shingles for faster phrases
  • ngrams for infix search
  • edge n-grams for prefix/suffix search

Yet leveraging them to improve search speed typically makes Elasticsearch much harder to use since query parsers are not aware of whether these filters are in use.

To give the example of prefix search, I'm wondering whether we should add a MappedFieldType.prefixQuery factory method that would be called by query parsers. Regular text fields would still create a PrefixQuery but we could have a new field type that would be optimized for prefix search which would automatically add a filter to the analysis chain at index time. It would be like the edge n-gram filter except that it would add a marker to differenciate prefixes from actual terms. For instance if we want to optimize prefix queries for prefixes that would be up to 4 chars, we could analyze foobar as [foobar, \0f, \0fo, \0foo, \0foob]. I'm using \0 here but anything that can help differenciate prefixes from the original term while preventing collisions would work.

Then at search time, MappedFieldType.prefixQuery would look at the length of the term and prepend a \0 and run a term query if there are 4 chars or less, and run a regular PrefixQuery otherwise.

We could do the same for infix search or phrase queries using similar ideas.

@jpountz jpountz added :Search/Search Search-related issues that do not fall into other categories discuss >feature labels Oct 19, 2017
@nik9000
Copy link
Member

nik9000 commented Oct 19, 2017

I'd played with something like this a few years ago. It looked like it worked then but I never got around to finishing it. I do really like the idea of automatically optimizing these when the user asks for extra data to be indexed.

I'm curious what the advantage of using a prefix is to using a separate Lucene field under the covers. Lower overhead, certainly, but you have to make sure that things like wildcards don't accidentally match the prefix.

I wonder if it'd be possible to make this a set of flags on text and keyword fields. You could flip them on to automatically optimize prefix search and the like.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 19, 2017

separate Lucene field under the covers

This is something we need to think about indeed. I initially thought about a single field so that you would not pay 2x the price for analysis, but maintaining multiple fields would avoid potential issues with wildcard/fuzzy indeed, as well as probably making things simpler, especially if you want to optimize for both phrase and prefix queries on the same field.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 25, 2017

Thinking more about it, the trade-off is hard. Here are the pros/cons I can think of:

Feature 1 field 2 fields
Fuzzy/wildcard/regex queries only works well if the generated automaton does not start with a wildcard, otherwise prefix terms might match works as usual
Indexing speed analysis performed once analysis performed twice
Query speed a bit slower due to larger terms dict as usual
Span queries works requires field-masking (which is problematic)
Match phrase prefix query can be optimized can't be optimized: phrases require a single field
Highlighting works prefix queries can't be highlighted

@jpountz
Copy link
Contributor Author

jpountz commented Oct 31, 2017

I only realized now that this could help make the match_phrase_prefix query efficient (and more accurate). I like it because it means that every text field optimized for prefix queries could be reasonably good at returning suggestions.

@jimczi
Copy link
Contributor

jimczi commented Nov 3, 2017

The standard codec in Lucene had an option to handle auto-prefix in the dictionary. It was a nice feature that was able to build prefixes that contain more than N documents. The selection was done efficiently at flush time and merge time but it was removed because it was hard to maintain internally. I think it could be the same thing with the one field solution. If we start to add special prefix in the terms dictionary that should not match certain query but be used for others it might become impossible to maintain at some point. I like the solution with two fields because it can be contained to some queries more easily.
For instance we could have three specialized extra field prefix, phrase and phrase_prefix.
In these three types we can do all the optimizations we want but we would use them only for prefix, phrase or phrase_prefix query. For the phrase_prefix we can build the extra field specially to handle this type of query so there might be no need to use the original field at all. It's also more costly because we might have some duplicate terms in these two fields.
The major problem IMO is the highlighting that would break if we rewrite some queries internally to use another field. Though I tend to see this as a push to prioritize #5172. That's a very good use case for this feature especially if we ensure that positions and offsets are the same when we index the document.

@jpountz
Copy link
Contributor Author

jpountz commented Dec 12, 2017

If we start to add special prefix in the terms dictionary that should not match certain query but be used for others it might become impossible to maintain at some point.

I would agree with that statement. I wanted to avoid having to resort to the field-masking span query, but I agree it is not worth making it a maintenance nightmare.

@romseygeek
Copy link
Contributor

I've been playing around with this for prefix terms. The main issue I'm running into is that it isn't as simple as just adding an ngram tokenfilter to the end of the analysis chain, because stemming messes things up.

  1. Indexing into a separate field: We'd want to generate a tokenstream using a normalizer, and wrap it with an EdgeNGramTokenFilter. This means generating a normalizer for all analyzers when the IndexAnalyzers object is built, which is not the case at the moment. This would also help improve standard prefix queries, which I think either don't analyze their input terms at all, or pass them through the full analysis chain.

  2. Term stacking in a single field: We'd want to check if there are non-multitermaware components in the analysis chain, and if so inject the edgengram filter in beforehand, possibly extending things so that the ngrams are marked as Keywords so they don't get stemmed.

In both cases we're modifying analyzers after they've been built, which isn't trivial because IndexAnalyzers doesn't hold onto that information

@jpountz
Copy link
Contributor Author

jpountz commented Jan 15, 2018

@romseygeek and I just discussed this and stemming is probably not a concern: results won't be different from a regular wildcard query on a field that has a stemmer in its analysis chain. Furthermore having the same result set regardless of the optimization is probably a good feature.

romseygeek added a commit that referenced this issue Jan 30, 2018
This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field.

The mapping looks like this:

"my_text_field" : {
"type" : "text",
"analyzer" : "english",
"index_prefix" : { "min_chars" : 1, "max_chars" : 10 }
}

Relates to #27049
romseygeek added a commit that referenced this issue Jan 30, 2018
This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field.

The mapping looks like this:

"my_text_field" : {
"type" : "text",
"analyzer" : "english",
"index_prefix" : { "min_chars" : 1, "max_chars" : 10 }
}

Relates to #27049
@cbuescher
Copy link
Member

cc @elastic/es-search-aggs

romseygeek added a commit that referenced this issue Jun 4, 2018
Specifying `index_phrases: true` on a text field mapping will add a subsidiary
[field]._index_phrase field, indexing two-term shingles from the parent field.
The parent analysis chain is re-used, wrapped with a FixedShingleFilter.

At query time, if a phrase match query is executed, the mapping will redirect it
to run against the subsidiary field.

This should trade faster phrase querying for a larger index and longer indexing
times.

Relates to #27049
@jpountz
Copy link
Contributor Author

jpountz commented Jun 4, 2018

Closed via #28290 and #30450, courtesy of @romseygeek.

@jpountz jpountz closed this as completed Jun 4, 2018
romseygeek added a commit that referenced this issue Jun 4, 2018
Specifying `index_phrases: true` on a text field mapping will add a subsidiary
[field]._index_phrase field, indexing two-term shingles from the parent field.
The parent analysis chain is re-used, wrapped with a FixedShingleFilter.

At query time, if a phrase match query is executed, the mapping will redirect it
to run against the subsidiary field.

This should trade faster phrase querying for a larger index and longer indexing
times.

Relates to #27049
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

7 participants