-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added max_expansion param to span_multi #30913
Conversation
Pinging @elastic/es-search-aggs |
@nirmalc Thank you for submitting the PR. It nicely addresses #27432 So this PR, although submitted against master, will be for 6.4 version. So, for 6.x version, we want to do following:
Unchecked items still need to be implemented in your PR |
SpanTermQuery q = new SpanTermQuery(term, states); | ||
topLevel.add(q); | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not throw an error, I would just ensure that we collect only up to maxExpansions
queries. @jimczi what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks , I will look at checklist . I thought the plan was to throw exception if term expansion count is above max_expansions ( which is defaulted to 1024 ) . The other pattern can be done already by adding rewrite
param to inner multi query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to throw an exception, at least in master. I don't think we need a max_expansion
option though, we can just honor BooleanQuery#getMaxClauseCount()
and fails the query if there are more terms. The span scoring rewrite does not allow to limit the number of expansions, it is all or nothing and I think it's a good property since it will not confuse the users by returning partial responses on prefix queries that match more than 1024 terms. This is the main reason why I opened the issue in the first place, we should never build a disjunction on more than BooleanQuery.getMaxClauseCount()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nirmalc. The change looks good but I wonder if we really need a max_expansion
option or if we should only honor BooleanQuery.getMaxClauseCount()
? MultiTermQuery that uses a TopTermRewrite strategy have the same limit so it would be consistent and wouldn't require the addition of another option that could be confusing since the rewrite method can be overridden in the multi term query directly.
} | ||
|
||
public class TopTermSpanBooleanQueryRewriteWithMaxClause extends SpanMultiTermQueryWrapper.SpanRewriteMethod{ | ||
MultiTermQuery multiTermQuery; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create a constructor to assign this variable and make it final ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 , will do - I thought it will be nice to have more dynamic way to override instead of boolean.max count ( I believe its a static property and requires yml file change ) ; Dynamically changing is very useful in case of cluster with multiple indices/data shapes. But I do agree that it can be confusing .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or should we use BooleanQuery.getMaxClauseCount()
as default and provide an override ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamically changing is very useful in case of cluster with multiple indices/data shapes.
Which multi_term
query are you using ? Is it to increase the default value of 1024 ? Any multi term query that don't use top terms rewriting is not a good fit for the span query in general. If you need to perform prefix queries within span queries it is preferable to index the prefix using the edge_ngram
filter. This way you can transform any prefix query into a span term query on a single term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for feedback @jimczi , Multi queries could be prefix , wildcard or worst case -regex. Im using span queries now for exact use case mentioned in ES documentation of span queries - legal discovery. We considered edge ngrams which takes care of part of that use case. data volume( text) and unpredictable cardinality is another issue ( often few thousands for many multi searches)
My request is to consider it as advanced param , but if elastic thinks it is confusing for users - i can change PR to use boolean query limit. Please let me know ! Thanks again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original intent was to add a max_expansion
to force the rewrite of all multi terms to use a top terms rewrite when used inside a span query but in this case the max_expansion
param can be interpreted as the number of top terms to keep during the rewrite. However I prefer your approach which throws an exception on multi_term
queries that don't use top terms rewrite and matches more than 1024 terms. In this case we can fail the query with a nice message explaining that a top term rewrite should be used on the multi_term
query if applicable. max_expansion
is confusing in this scenario and doesn't bring much so I think we should simply honor the BooleanQuery max clause limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 i will update the PR in a day or two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @nirmalc !
return new SpanMultiTermQueryBuilder(subQuery).queryName(queryName).boost(boost).maxExpansions(maxExpansionValue); | ||
} | ||
|
||
public class TopTermSpanBooleanQueryRewriteWithMaxClause extends SpanMultiTermQueryWrapper.SpanRewriteMethod{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need to be public ? I'd prefer a private static class since it's not used outside of this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just used in one of unit tests..
@jimczi , please let me know if changes look good , I'll squash and rebase with master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left more comments but I like it overall.
Can you also merge with latest master, I'll run the tests afterward.
|
||
span multi query will hit too many clauses failure if it exceeds the boolean | ||
query limit (defaults to `1024`) unless inner multi query uses top_terms_N | ||
rewrite method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a link to the documentation: <<query-dsl-multi-term-rewrite,rewrite>>
|
||
public class SpanMultiQueryIT extends ESIntegTestCase { | ||
public void testTermExpansionExceptionOnSpanFailure() throws ExecutionException, InterruptedException { | ||
org.elasticsearch.common.settings.Settings.Builder builder = org.elasticsearch.common.settings.Settings.builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can import Settings and remove the package name ?
|
||
createIndex("test", builder.build()); | ||
ArrayList<IndexRequestBuilder> reqs = new ArrayList<>(); | ||
for (int i = 0; i < BooleanQuery.getMaxClauseCount() + 1; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a big test to check for an exception. I'd something like:
final int origValue = BooleanQuery.getMaxClauseCount();
BooleanQuery.setMaxClauseCount(5);
try {
// test goes here
} finally {
BooleanQuery.setMaxClauseCount(origValue);
}
import java.util.ArrayList; | ||
import java.util.concurrent.ExecutionException; | ||
|
||
public class SpanMultiQueryIT extends ESIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these tests to org.elasticsearch.search.query.SearchQueryIT, we don't need another IT test to check this.
protected void checkMaxClauseCount(int count) { | ||
if (count > maxExpansions) { | ||
throw new ElasticsearchException("[" + multiTermQuery.toString() + " ] " + | ||
"exceeds maxExpansions [maxExpansions is set to " + maxExpansions + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention boolean query max clause and top terms rewrite ?
33c5e4b
to
de00e85
Compare
@jimczi new commit should address all code review comments . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good @nirmalc ! I left some minor comments but I think it's getting close.
@@ -1819,4 +1826,52 @@ public void testRangeQueryRangeFields_24744() throws Exception { | |||
assertHitCount(searchResponse, 1); | |||
} | |||
|
|||
public void testTermExpansionExceptionOnSpanFailure() throws ExecutionException, InterruptedException { | |||
org.elasticsearch.common.settings.Settings.Builder builder = org.elasticsearch.common.settings.Settings.builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add an import for Settings
|
||
expectThrows(ElasticsearchException.class, () -> | ||
client().prepareSearch().setIndices("test").setQuery(queryBuilder).get()); | ||
}finally{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing space
|
||
} | ||
|
||
public void testDefaultMaxRewriteWithDefaultValue() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a duplicate of the previous test ?
"}," + | ||
"\"boost\" :1.0" + | ||
"}" + | ||
"}"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The xcontent parsing is tested in the parent class, can you modify this test to use the query builders instead ?
@nirmalc can you merge with master and resolve the conflicts ? I'll trigger the tests when it's done. |
de00e85
to
68e2b13
Compare
Done, sorry - my fork was bit old and i had to get rid of some tests since we decided not to do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nirmalc , it's really close however I left some comments that need to be addressed before we run the tests.
query limit (defaults to `1024`) unless inner multi query uses top_terms_N | ||
<<query-dsl-multi-term-rewrite,rewrite>> parameter. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the WARNING here and maybe reword this to:
WARNING: `span_multi` queries will hit too many clauses failure if the number of terms that match the query exceeds the boolean query limit (defaults to 1024). To avoid an unbounded expansion you can set the
<<query-dsl-multi-term-rewrite,rewrite method>> of the multi term query to `top_terms_*`
rewrite. Or, if you use `span_multi` on `prefix` query only, you can
activate the <<index-prefix-config,`index_prefixes`>> field option of the `text` field instead. This will
rewrite any prefix query on the field to a a single term query that matches the indexed prefix.
} | ||
|
||
public void testTermExpansionExceptionOnSpan() throws ExecutionException, InterruptedException { | ||
Settings.Builder builder = Settings.builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test is needed. testSpanMultiTermQuery
does the same thing.
By default span_multi query will limit term expansions = boolean max clause. This will limit high heap usage in case of high cardinality term expansions.This applies only if top_terms_N is not used in inner multi query
68e2b13
to
816e37f
Compare
updated doc + removed test you suggested as already there is one testing that scenario. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks @nirmalc , I'll merge if the tests pass
test this please |
By default span_multi query will limit term expansions = boolean max clause. This will limit high heap usage in case of high cardinality term expansions. This applies only if top_terms_N is not used in inner multi query.
Causes span query to throw exception if term expansions with in
multi_term inner query exceeds max_expansion param
Example query:
{"span_multi" : {"match" : {"prefix" : {"user" : {"value" : "ki","boost" : 1.08}}},"max_expansions" : 22 ,"boost" :1.0}}
I was trying to patch one older version of ES to avoid heap crash, but noticed that its logged as an issue here - #27432