-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] StackOverflow crash - large regex produced by Discover filter not limited by index.regex_max_length #1992
Comments
Related Issue: #1651 |
Looking into it |
Hi @gplechuck: Thanks for taking time and raising this issue. I followed repro steps mentioned in description but was not able to reproduce the issue on my mac OS with latest changes from OpenSearch and OpenSearch-Dashboards repo.
I have below follow up questions to ensure I understand this issue correctly.
Can you please share the end-point which is used for this query or sample request/response ?
Where do you see aggregated regex includes complete string and adding ".*".
Can you please share the opensearch logs showing this instant crash ? |
Hey @dreamer-89 , Thanks for picking this up. Just tested again using clean container images and reproduced. Just a quick test - but hopefully enough to point you in the right direction for reproducing the issue . Disabled HTTPS for convenience to grab a traffic capture in order to view the request (alternatively enabling audit logging for the REST interface should work just as well) . Steps taken -
Here's a sample from a traffic capture I ran while doing this test -
Note the first query contains "include": ".*" - this autocompleted the drop down when no text was entered (eg all _id values) . The second query was sent when I pasted 50K chars into the value box - I did not submit the request, just pasted the chars. I attached the opensearch logs in an earlier comment - see 'opensearch-regex-fatal-error.log' . Here's a snippet from the container stdout viewable in the terminal (please check the earlier attachment for the full error). We can see opensearch-node1 die after which point opensearch-dashboards cannot connect anymore.
Hope that helps! Any other questions, shout. Cheers |
Hey @dreamer-89 ! Can we get an update on this issue? Thanks! |
Apologies @gplechuck for delay on this issue and thank you for sharing the detailed steps for isssue reproduction along with traffic capture. This is very helpful and I can easily reproduce the issue following provided steps. I am also able to reproduce the bug (node shutdown) locally on my mac system by starting OpenSearch engine and OpenSearch dashboards locally from codebase and following the replication steps. |
Steps to reproduce directly on engine side:
|
The root cause of this looks to be within Lucene's This scenario can be repro'd with the following unit test: public void testStackOverflow() {
StringBuilder strBuilder = new StringBuilder();
for (int i = 0; i < 50000; i++) {
strBuilder.append("a");
}
new IncludeExclude(strBuilder.toString(), null);
} The |
Hmmmm,.....this may be something we want to submit a fix for upstream in Lucene, regardless of how we mitigate within OpenSearch. |
@kartg Can you please open an issue in Lucene? A |
Update - I've written up a mitigation that attempts to respect the As for opening an issue with Lucene, I'm still waiting on my account recovery email. In the meantime, I found an old issue that alludes to this - https://issues.apache.org/jira/browse/LUCENE-6156 . It was subsequently closed citing the fact that Elasticsearch (at the time) "shrinks the jvm default stack size", so Lucene parsing was not the cause of the problem. Once I'm done with the mitigation, I'll need to investigate if the Opensearch |
I think we're just hitting the fact that the recursive algorithm uses one stack frame per regex operation. See this simple test not using OpenSearch at all:
|
Thanks @andrross! FWIW, i've cut an issue with Lucene, though they may just state this is a limitation of the recursive pattern and it isn't possible to move to a non-recursive model. |
My original attempt at a mitigation failed miserably, so I'm back at the drawing board. To reiterate, this change does not seek to prevent the StackOverflow error from reg-ex parsing; doing so is not feasible since the root cause lies within the Lucene implementation and the overflow threshold is dictated by a JVM setting. Instead, we're seeking to correctly enforce the The complexity here stems from the fact that the logic for parsing the "include" reg-ex is set up at bootstrap/startup time across multiple term parsers (example). Since there is no notion of a "index" in this context, the index-level setting/limit cannot be retrieved/applied here. The right location to enforce this would be at the runtime point where a search query against an index arrives at the node and must be parsed. In the ideal case, the |
@kartg The major problem is that the node dies in this case, isn't it? Shouldn't the fix be a catch all that causes the query to fail instead of the process to die? Then avoiding the exception becomes desirable but not critical. |
Catching the StackOverflowError could be a tactical fix but I'm pretty wary of it. Catching @kartg Would it make things easier if IncludeExclude was refactored not to store RegExp instances as fields, and instead just store the original string and then create the RegExp instances on demand? It looks like most of the usages of those RegExp fields is to retrieve the original string anyway, and if I'm reading it right there's only one place where it is needed as a RegExp instance. It might be easier to plumb in the length enforcement check at that point. |
That's a really good idea - I think this approach might work. I'll work on it. Even with the limit check in place, there's no guarantee that we won't still hit a @andrross @dblock what do you both think? Should we implement both guardrails? |
As long as the default regex limit combined with the default Xss setting means that it is impossible to hit the stack overflow condition, then I don't think we need additional handling logic. I'm not sure that we need to be super defensive against mis-configurations provided the defaults are safe. |
I think I agree with @andrross. On catchalls, it sounds like in general we want the node to go down if it, for example, runs out of memory or goes into some infinite recursion? Is this by design and do we spell this out anywhere? |
Sharing my observations on this issue till now. I see more variants of aggregations, are broken currently (listed below); all of them belonging to
|
Update - Done with the code change, and my empirical testing shows that the node no longer falls over for large strings. However, this appears to be from the fact that we're no longer eagerly creating a |
Confirmed that delaying the parsing of the Thus, to expand on @dreamer-89 's repro steps:
then
|
@dreamer-89 |
@dblock: Thanks for your comment. I verified yesterday that bug persist in latest version of Elasticsearch here. Thank you for finding and sharing the bug :) |
No problem, and thanks for following up on it . |
Describe the bug
It seems to possible to crash Opensearch nodes by providing a very large string when attempting to filter on a field value (StackOverflow related to regexp processing). When filtering on a field value, a query containing a 'suggestions' aggregation is sent to the cluster in the background before the filter is saved, in order to populate an autocomplete drop down. This aggregation includes a regex which is constructed from taking the large string and suffixing with a ".*" . The resulting regexp does not seem to respect the default index.max_regex_length limit of 1000 - the query is submitted causing an instant crash of nodes.
To Reproduce
Reproduced using the latest Opensearch Docker images -
Expected behavior
Expect that the query will be terminated before being allowed to crash the cluster. The bug is present in some versions of Elasticsearch, but does not appear to be present in the latest version (7.16) . It's present in 7.10.2 , the last version tracked before the Opensearch split - so it probably needs to be addressed in the Opensearch codebase now. In Elasticsearch 7.16 the following response is returned -
{"_shards":{"total":1,"successful":0,"failed":1,"failures":[{"shard":0,"index":"kibana_sample_data_ecommerce","status":"INTERNAL_SERVER_ERROR","reason":{"type":"broadcast_shard_operation_failed_exception","reason":"java.lang.IllegalArgumentException: input automaton is too large: 1001.......................
Plugins
Nothing additional to default plugins (security etc..)
Host/Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: