Make it easier to optimize search with better analysis #27049

jpountz · 2017-10-19T09:09:44Z

We have a number of filters that can help make search faster:

shingles for faster phrases
ngrams for infix search
edge n-grams for prefix/suffix search

Yet leveraging them to improve search speed typically makes Elasticsearch much harder to use since query parsers are not aware of whether these filters are in use.

To give the example of prefix search, I'm wondering whether we should add a MappedFieldType.prefixQuery factory method that would be called by query parsers. Regular text fields would still create a PrefixQuery but we could have a new field type that would be optimized for prefix search which would automatically add a filter to the analysis chain at index time. It would be like the edge n-gram filter except that it would add a marker to differenciate prefixes from actual terms. For instance if we want to optimize prefix queries for prefixes that would be up to 4 chars, we could analyze foobar as [foobar, \0f, \0fo, \0foo, \0foob]. I'm using \0 here but anything that can help differenciate prefixes from the original term while preventing collisions would work.

Then at search time, MappedFieldType.prefixQuery would look at the length of the term and prepend a \0 and run a term query if there are 4 chars or less, and run a regular PrefixQuery otherwise.

We could do the same for infix search or phrase queries using similar ideas.

The text was updated successfully, but these errors were encountered:

nik9000 · 2017-10-19T13:12:12Z

I'd played with something like this a few years ago. It looked like it worked then but I never got around to finishing it. I do really like the idea of automatically optimizing these when the user asks for extra data to be indexed.

I'm curious what the advantage of using a prefix is to using a separate Lucene field under the covers. Lower overhead, certainly, but you have to make sure that things like wildcards don't accidentally match the prefix.

I wonder if it'd be possible to make this a set of flags on text and keyword fields. You could flip them on to automatically optimize prefix search and the like.

jpountz · 2017-10-19T13:25:06Z

separate Lucene field under the covers

This is something we need to think about indeed. I initially thought about a single field so that you would not pay 2x the price for analysis, but maintaining multiple fields would avoid potential issues with wildcard/fuzzy indeed, as well as probably making things simpler, especially if you want to optimize for both phrase and prefix queries on the same field.

jpountz · 2017-10-25T09:35:56Z

Thinking more about it, the trade-off is hard. Here are the pros/cons I can think of:

Feature	1 field	2 fields
Fuzzy/wildcard/regex queries	only works well if the generated automaton does not start with a wildcard, otherwise prefix terms might match	works as usual
Indexing speed	analysis performed once	analysis performed twice
Query speed	a bit slower due to larger terms dict	as usual
Span queries	works	requires field-masking (which is problematic)
Match phrase prefix query	can be optimized	can't be optimized: phrases require a single field
Highlighting	works	prefix queries can't be highlighted

jpountz · 2017-10-31T19:11:38Z

I only realized now that this could help make the match_phrase_prefix query efficient (and more accurate). I like it because it means that every text field optimized for prefix queries could be reasonably good at returning suggestions.

jimczi · 2017-11-03T13:48:30Z

The standard codec in Lucene had an option to handle auto-prefix in the dictionary. It was a nice feature that was able to build prefixes that contain more than N documents. The selection was done efficiently at flush time and merge time but it was removed because it was hard to maintain internally. I think it could be the same thing with the one field solution. If we start to add special prefix in the terms dictionary that should not match certain query but be used for others it might become impossible to maintain at some point. I like the solution with two fields because it can be contained to some queries more easily.
For instance we could have three specialized extra field prefix, phrase and phrase_prefix.
In these three types we can do all the optimizations we want but we would use them only for prefix, phrase or phrase_prefix query. For the phrase_prefix we can build the extra field specially to handle this type of query so there might be no need to use the original field at all. It's also more costly because we might have some duplicate terms in these two fields.
The major problem IMO is the highlighting that would break if we rewrite some queries internally to use another field. Though I tend to see this as a push to prioritize #5172. That's a very good use case for this feature especially if we ensure that positions and offsets are the same when we index the document.

jpountz · 2017-12-12T10:41:39Z

If we start to add special prefix in the terms dictionary that should not match certain query but be used for others it might become impossible to maintain at some point.

I would agree with that statement. I wanted to avoid having to resort to the field-masking span query, but I agree it is not worth making it a maintenance nightmare.

romseygeek · 2018-01-15T09:32:56Z

I've been playing around with this for prefix terms. The main issue I'm running into is that it isn't as simple as just adding an ngram tokenfilter to the end of the analysis chain, because stemming messes things up.

Indexing into a separate field: We'd want to generate a tokenstream using a normalizer, and wrap it with an EdgeNGramTokenFilter. This means generating a normalizer for all analyzers when the IndexAnalyzers object is built, which is not the case at the moment. This would also help improve standard prefix queries, which I think either don't analyze their input terms at all, or pass them through the full analysis chain.
Term stacking in a single field: We'd want to check if there are non-multitermaware components in the analysis chain, and if so inject the edgengram filter in beforehand, possibly extending things so that the ngrams are marked as Keywords so they don't get stemmed.

In both cases we're modifying analyzers after they've been built, which isn't trivial because IndexAnalyzers doesn't hold onto that information

jpountz · 2018-01-15T10:21:22Z

@romseygeek and I just discussed this and stemming is probably not a concern: results won't be different from a regular wildcard query on a field that has a stemmer in its analysis chain. Furthermore having the same result set regardless of the optimization is probably a good feature.

This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field. The mapping looks like this: "my_text_field" : { "type" : "text", "analyzer" : "english", "index_prefix" : { "min_chars" : 1, "max_chars" : 10 } } Relates to #27049

cbuescher · 2018-03-13T18:08:23Z

cc @elastic/es-search-aggs

Specifying `index_phrases: true` on a text field mapping will add a subsidiary [field]._index_phrase field, indexing two-term shingles from the parent field. The parent analysis chain is re-used, wrapped with a FixedShingleFilter. At query time, if a phrase match query is executed, the mapping will redirect it to run against the subsidiary field. This should trade faster phrase querying for a larger index and longer indexing times. Relates to #27049

jpountz · 2018-06-04T07:55:30Z

Closed via #28290 and #30450, courtesy of @romseygeek.

Specifying `index_phrases: true` on a text field mapping will add a subsidiary [field]._index_phrase field, indexing two-term shingles from the parent field. The parent analysis chain is re-used, wrapped with a FixedShingleFilter. At query time, if a phrase match query is executed, the mapping will redirect it to run against the subsidiary field. This should trade faster phrase querying for a larger index and longer indexing times. Relates to #27049

jpountz added :Search/Search Search-related issues that do not fall into other categories discuss >feature labels Oct 19, 2017

tomcallahan added team-discuss and removed discuss labels May 24, 2018

colings86 removed the team-discuss label May 25, 2018

jpountz closed this as completed Jun 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it easier to optimize search with better analysis #27049

Make it easier to optimize search with better analysis #27049

jpountz commented Oct 19, 2017 •

edited

Loading

nik9000 commented Oct 19, 2017

jpountz commented Oct 19, 2017

jpountz commented Oct 25, 2017 •

edited

Loading

jpountz commented Oct 31, 2017

jimczi commented Nov 3, 2017

jpountz commented Dec 12, 2017

romseygeek commented Jan 15, 2018

jpountz commented Jan 15, 2018

cbuescher commented Mar 13, 2018

jpountz commented Jun 4, 2018

Make it easier to optimize search with better analysis #27049

Make it easier to optimize search with better analysis #27049

Comments

jpountz commented Oct 19, 2017 • edited Loading

nik9000 commented Oct 19, 2017

jpountz commented Oct 19, 2017

jpountz commented Oct 25, 2017 • edited Loading

jpountz commented Oct 31, 2017

jimczi commented Nov 3, 2017

jpountz commented Dec 12, 2017

romseygeek commented Jan 15, 2018

jpountz commented Jan 15, 2018

cbuescher commented Mar 13, 2018

jpountz commented Jun 4, 2018

jpountz commented Oct 19, 2017 •

edited

Loading

jpountz commented Oct 25, 2017 •

edited

Loading