-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier to optimize search with better analysis #27049
Comments
I'd played with something like this a few years ago. It looked like it worked then but I never got around to finishing it. I do really like the idea of automatically optimizing these when the user asks for extra data to be indexed. I'm curious what the advantage of using a prefix is to using a separate Lucene field under the covers. Lower overhead, certainly, but you have to make sure that things like wildcards don't accidentally match the prefix. I wonder if it'd be possible to make this a set of flags on text and keyword fields. You could flip them on to automatically optimize prefix search and the like. |
This is something we need to think about indeed. I initially thought about a single field so that you would not pay 2x the price for analysis, but maintaining multiple fields would avoid potential issues with wildcard/fuzzy indeed, as well as probably making things simpler, especially if you want to optimize for both phrase and prefix queries on the same field. |
Thinking more about it, the trade-off is hard. Here are the pros/cons I can think of:
|
I only realized now that this could help make the |
The standard codec in Lucene had an option to handle auto-prefix in the dictionary. It was a nice feature that was able to build prefixes that contain more than N documents. The selection was done efficiently at flush time and merge time but it was removed because it was hard to maintain internally. I think it could be the same thing with the one field solution. If we start to add special prefix in the terms dictionary that should not match certain query but be used for others it might become impossible to maintain at some point. I like the solution with two fields because it can be contained to some queries more easily. |
I would agree with that statement. I wanted to avoid having to resort to the field-masking span query, but I agree it is not worth making it a maintenance nightmare. |
I've been playing around with this for prefix terms. The main issue I'm running into is that it isn't as simple as just adding an ngram tokenfilter to the end of the analysis chain, because stemming messes things up.
In both cases we're modifying analyzers after they've been built, which isn't trivial because IndexAnalyzers doesn't hold onto that information |
@romseygeek and I just discussed this and stemming is probably not a concern: results won't be different from a regular wildcard query on a field that has a stemmer in its analysis chain. Furthermore having the same result set regardless of the optimization is probably a good feature. |
This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field. The mapping looks like this: "my_text_field" : { "type" : "text", "analyzer" : "english", "index_prefix" : { "min_chars" : 1, "max_chars" : 10 } } Relates to #27049
This adds the ability to index term prefixes into a hidden subfield, enabling prefix queries to be run without multitermquery rewrites. The subfield reuses the analysis chain of its parent text field, appending an EdgeNGramTokenFilter. It can be configured with minimum and maximum ngram lengths. Query terms with lengths outside this min-max range fall back to using prefix queries against the parent text field. The mapping looks like this: "my_text_field" : { "type" : "text", "analyzer" : "english", "index_prefix" : { "min_chars" : 1, "max_chars" : 10 } } Relates to #27049
cc @elastic/es-search-aggs |
Specifying `index_phrases: true` on a text field mapping will add a subsidiary [field]._index_phrase field, indexing two-term shingles from the parent field. The parent analysis chain is re-used, wrapped with a FixedShingleFilter. At query time, if a phrase match query is executed, the mapping will redirect it to run against the subsidiary field. This should trade faster phrase querying for a larger index and longer indexing times. Relates to #27049
Closed via #28290 and #30450, courtesy of @romseygeek. |
Specifying `index_phrases: true` on a text field mapping will add a subsidiary [field]._index_phrase field, indexing two-term shingles from the parent field. The parent analysis chain is re-used, wrapped with a FixedShingleFilter. At query time, if a phrase match query is executed, the mapping will redirect it to run against the subsidiary field. This should trade faster phrase querying for a larger index and longer indexing times. Relates to #27049
We have a number of filters that can help make search faster:
Yet leveraging them to improve search speed typically makes Elasticsearch much harder to use since query parsers are not aware of whether these filters are in use.
To give the example of prefix search, I'm wondering whether we should add a
MappedFieldType.prefixQuery
factory method that would be called by query parsers. Regulartext
fields would still create aPrefixQuery
but we could have a new field type that would be optimized for prefix search which would automatically add a filter to the analysis chain at index time. It would be like the edge n-gram filter except that it would add a marker to differenciate prefixes from actual terms. For instance if we want to optimize prefix queries for prefixes that would be up to 4 chars, we could analyzefoobar
as [foobar
,\0f
,\0fo
,\0foo
,\0foob
]. I'm using\0
here but anything that can help differenciate prefixes from the original term while preventing collisions would work.Then at search time,
MappedFieldType.prefixQuery
would look at the length of the term and prepend a\0
and run aterm
query if there are 4 chars or less, and run a regularPrefixQuery
otherwise.We could do the same for infix search or phrase queries using similar ideas.
The text was updated successfully, but these errors were encountered: