Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added max_expansion param to span_multi #30913

Merged
merged 1 commit into from
Jun 7, 2018

Conversation

nirmalc
Copy link
Contributor

@nirmalc nirmalc commented May 28, 2018

Causes span query to throw exception if term expansions with in
multi_term inner query exceeds max_expansion param

Example query:

{"span_multi" : {"match" : {"prefix" : {"user" : {"value" : "ki","boost" : 1.08}}},"max_expansions" : 22 ,"boost" :1.0}}

I was trying to patch one older version of ES to avoid heap crash, but noticed that its logged as an issue here - #27432

@mayya-sharipova mayya-sharipova added the :Search/Search Search-related issues that do not fall into other categories label May 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@mayya-sharipova
Copy link
Contributor

@nirmalc Thank you for submitting the PR. It nicely addresses #27432

So this PR, although submitted against master, will be for 6.4 version.
For 7.x version, we will need another PR, slightly different.

So, for 6.x version, we want to do following:

  • Add max_expansions into span_multi query as an optional parameter that defaults to 1024
  • Issue a deprecation warning if a user submits span_multi query without max_expansions parameter
  • throw an error if max_expansions is greater than 1024
  • add max_expansions as an override of the rewrite in 6.x
  • update documentation for span_multi query

Unchecked items still need to be implemented in your PR

SpanTermQuery q = new SpanTermQuery(term, states);
topLevel.add(q);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not throw an error, I would just ensure that we collect only up to maxExpansions queries. @jimczi what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks , I will look at checklist . I thought the plan was to throw exception if term expansion count is above max_expansions ( which is defaulted to 1024 ) . The other pattern can be done already by adding rewrite param to inner multi query

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to throw an exception, at least in master. I don't think we need a max_expansion option though, we can just honor BooleanQuery#getMaxClauseCount() and fails the query if there are more terms. The span scoring rewrite does not allow to limit the number of expansions, it is all or nothing and I think it's a good property since it will not confuse the users by returning partial responses on prefix queries that match more than 1024 terms. This is the main reason why I opened the issue in the first place, we should never build a disjunction on more than BooleanQuery.getMaxClauseCount().

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nirmalc. The change looks good but I wonder if we really need a max_expansion option or if we should only honor BooleanQuery.getMaxClauseCount() ? MultiTermQuery that uses a TopTermRewrite strategy have the same limit so it would be consistent and wouldn't require the addition of another option that could be confusing since the rewrite method can be overridden in the multi term query directly.

}

public class TopTermSpanBooleanQueryRewriteWithMaxClause extends SpanMultiTermQueryWrapper.SpanRewriteMethod{
MultiTermQuery multiTermQuery;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a constructor to assign this variable and make it final ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 , will do - I thought it will be nice to have more dynamic way to override instead of boolean.max count ( I believe its a static property and requires yml file change ) ; Dynamically changing is very useful in case of cluster with multiple indices/data shapes. But I do agree that it can be confusing .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should we use BooleanQuery.getMaxClauseCount() as default and provide an override ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamically changing is very useful in case of cluster with multiple indices/data shapes.

Which multi_term query are you using ? Is it to increase the default value of 1024 ? Any multi term query that don't use top terms rewriting is not a good fit for the span query in general. If you need to perform prefix queries within span queries it is preferable to index the prefix using the edge_ngram filter. This way you can transform any prefix query into a span term query on a single term.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for feedback @jimczi , Multi queries could be prefix , wildcard or worst case -regex. Im using span queries now for exact use case mentioned in ES documentation of span queries - legal discovery. We considered edge ngrams which takes care of part of that use case. data volume( text) and unpredictable cardinality is another issue ( often few thousands for many multi searches)

My request is to consider it as advanced param , but if elastic thinks it is confusing for users - i can change PR to use boolean query limit. Please let me know ! Thanks again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intent was to add a max_expansion to force the rewrite of all multi terms to use a top terms rewrite when used inside a span query but in this case the max_expansion param can be interpreted as the number of top terms to keep during the rewrite. However I prefer your approach which throws an exception on multi_term queries that don't use top terms rewrite and matches more than 1024 terms. In this case we can fail the query with a nice message explaining that a top term rewrite should be used on the multi_term query if applicable. max_expansion is confusing in this scenario and doesn't bring much so I think we should simply honor the BooleanQuery max clause limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 i will update the PR in a day or two

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @nirmalc !

return new SpanMultiTermQueryBuilder(subQuery).queryName(queryName).boost(boost).maxExpansions(maxExpansionValue);
}

public class TopTermSpanBooleanQueryRewriteWithMaxClause extends SpanMultiTermQueryWrapper.SpanRewriteMethod{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to be public ? I'd prefer a private static class since it's not used outside of this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just used in one of unit tests..

@nirmalc
Copy link
Contributor Author

nirmalc commented Jun 1, 2018

@jimczi , please let me know if changes look good , I'll squash and rebase with master

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left more comments but I like it overall.
Can you also merge with latest master, I'll run the tests afterward.


span multi query will hit too many clauses failure if it exceeds the boolean
query limit (defaults to `1024`) unless inner multi query uses top_terms_N
rewrite method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a link to the documentation: <<query-dsl-multi-term-rewrite,rewrite>>


public class SpanMultiQueryIT extends ESIntegTestCase {
public void testTermExpansionExceptionOnSpanFailure() throws ExecutionException, InterruptedException {
org.elasticsearch.common.settings.Settings.Builder builder = org.elasticsearch.common.settings.Settings.builder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can import Settings and remove the package name ?


createIndex("test", builder.build());
ArrayList<IndexRequestBuilder> reqs = new ArrayList<>();
for (int i = 0; i < BooleanQuery.getMaxClauseCount() + 1; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a big test to check for an exception. I'd something like:

final int origValue = BooleanQuery.getMaxClauseCount();
BooleanQuery.setMaxClauseCount(5);
try {
   // test goes here
} finally {
  BooleanQuery.setMaxClauseCount(origValue);
}

import java.util.ArrayList;
import java.util.concurrent.ExecutionException;

public class SpanMultiQueryIT extends ESIntegTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these tests to org.elasticsearch.search.query.SearchQueryIT, we don't need another IT test to check this.

protected void checkMaxClauseCount(int count) {
if (count > maxExpansions) {
throw new ElasticsearchException("[" + multiTermQuery.toString() + " ] " +
"exceeds maxExpansions [maxExpansions is set to " + maxExpansions + "]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention boolean query max clause and top terms rewrite ?

@nirmalc nirmalc force-pushed the spanMultiRewriteParam branch from 33c5e4b to de00e85 Compare June 1, 2018 13:53
@nirmalc
Copy link
Contributor Author

nirmalc commented Jun 4, 2018

@jimczi new commit should address all code review comments .

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good @nirmalc ! I left some minor comments but I think it's getting close.

@@ -1819,4 +1826,52 @@ public void testRangeQueryRangeFields_24744() throws Exception {
assertHitCount(searchResponse, 1);
}

public void testTermExpansionExceptionOnSpanFailure() throws ExecutionException, InterruptedException {
org.elasticsearch.common.settings.Settings.Builder builder = org.elasticsearch.common.settings.Settings.builder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add an import for Settings


expectThrows(ElasticsearchException.class, () ->
client().prepareSearch().setIndices("test").setQuery(queryBuilder).get());
}finally{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing space


}

public void testDefaultMaxRewriteWithDefaultValue() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate of the previous test ?

"}," +
"\"boost\" :1.0" +
"}" +
"}";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The xcontent parsing is tested in the parent class, can you modify this test to use the query builders instead ?

@jimczi
Copy link
Contributor

jimczi commented Jun 5, 2018

@nirmalc can you merge with master and resolve the conflicts ? I'll trigger the tests when it's done.

@nirmalc nirmalc force-pushed the spanMultiRewriteParam branch from de00e85 to 68e2b13 Compare June 5, 2018 21:35
@nirmalc
Copy link
Contributor Author

nirmalc commented Jun 5, 2018

Done, sorry - my fork was bit old and i had to get rid of some tests since we decided not to do max_expansion params.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nirmalc , it's really close however I left some comments that need to be addressed before we run the tests.

query limit (defaults to `1024`) unless inner multi query uses top_terms_N
<<query-dsl-multi-term-rewrite,rewrite>> parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the WARNING here and maybe reword this to:

WARNING: `span_multi` queries will hit too many clauses failure if the number of terms that match the query exceeds the boolean query limit (defaults to 1024). To avoid an unbounded expansion you can set the
<<query-dsl-multi-term-rewrite,rewrite method>> of the multi term query to `top_terms_*`
rewrite. Or, if you use `span_multi` on `prefix` query only, you can
activate the <<index-prefix-config,`index_prefixes`>> field option of the `text` field instead. This will
rewrite any prefix query on the field to a a single term query that matches the indexed prefix.

}

public void testTermExpansionExceptionOnSpan() throws ExecutionException, InterruptedException {
Settings.Builder builder = Settings.builder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is needed. testSpanMultiTermQuery does the same thing.

By default span_multi query will limit term expansions = boolean max clause.
This will limit high heap usage in case of high cardinality term
expansions.This applies only if top_terms_N is not used in inner multi
query
@nirmalc nirmalc force-pushed the spanMultiRewriteParam branch from 68e2b13 to 816e37f Compare June 6, 2018 15:28
@nirmalc
Copy link
Contributor Author

nirmalc commented Jun 6, 2018

updated doc + removed test you suggested as already there is one testing that scenario.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @nirmalc , I'll merge if the tests pass

@jimczi
Copy link
Contributor

jimczi commented Jun 6, 2018

test this please

@jimczi jimczi merged commit 75a676c into elastic:master Jun 7, 2018
jimczi pushed a commit that referenced this pull request Jun 7, 2018
By default span_multi query will limit term expansions = boolean max clause.
This will limit high heap usage in case of high cardinality term
expansions. This applies only if top_terms_N is not used in inner multi
query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants