Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] StackOverflow crash - large regex produced by Discover filter not limited by index.regex_max_length #1992

Closed
gplechuck opened this issue Jan 27, 2022 · 26 comments · Fixed by #2810
Assignees
Labels
bug Something isn't working Indexing & Search Severity-Critical v2.0.0 Version 2.0.0

Comments

@gplechuck
Copy link

Describe the bug
It seems to possible to crash Opensearch nodes by providing a very large string when attempting to filter on a field value (StackOverflow related to regexp processing). When filtering on a field value, a query containing a 'suggestions' aggregation is sent to the cluster in the background before the filter is saved, in order to populate an autocomplete drop down. This aggregation includes a regex which is constructed from taking the large string and suffixing with a ".*" . The resulting regexp does not seem to respect the default index.max_regex_length limit of 1000 - the query is submitted causing an instant crash of nodes.

To Reproduce
Reproduced using the latest Opensearch Docker images -

{
  "name" : "opensearch-node1",
  "cluster_name" : "opensearch-cluster",
  "cluster_uuid" : "ftk3wyp1RqOa0Yq5SS4ELA",
  "version" : {
    "distribution" : "opensearch",
    "number" : "1.2.4",
    "build_type" : "tar",
    "build_hash" : "e505b10357c03ae8d26d675172402f2f2144ef0f",
    "build_date" : "2022-01-14T03:38:06.881862Z",
    "build_snapshot" : false,
    "lucene_version" : "8.10.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}
  • load sample data in the dashboard
  • in the Discover pane, add a Filter , select a Field, select 'is' , paste a huge string into the search box - 50k chars will do the trick. The node receiving the query will crash.

Expected behavior
Expect that the query will be terminated before being allowed to crash the cluster. The bug is present in some versions of Elasticsearch, but does not appear to be present in the latest version (7.16) . It's present in 7.10.2 , the last version tracked before the Opensearch split - so it probably needs to be addressed in the Opensearch codebase now. In Elasticsearch 7.16 the following response is returned -

{"_shards":{"total":1,"successful":0,"failed":1,"failures":[{"shard":0,"index":"kibana_sample_data_ecommerce","status":"INTERNAL_SERVER_ERROR","reason":{"type":"broadcast_shard_operation_failed_exception","reason":"java.lang.IllegalArgumentException: input automaton is too large: 1001.......................

Plugins
Nothing additional to default plugins (security etc..)

Host/Environment (please complete the following information):

~ ❯❯❯ uname -a
5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

~ ❯❯❯ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal
@gplechuck gplechuck added bug Something isn't working untriaged labels Jan 27, 2022
@gplechuck
Copy link
Author

@anasalkouz
Copy link
Member

Related Issue: #1651

@dreamer-89
Copy link
Member

Looking into it

@dreamer-89 dreamer-89 self-assigned this Feb 3, 2022
@dreamer-89
Copy link
Member

dreamer-89 commented Feb 8, 2022

Hi @gplechuck:

Thanks for taking time and raising this issue.

I followed repro steps mentioned in description but was not able to reproduce the issue on my mac OS with latest changes from OpenSearch and OpenSearch-Dashboards repo.

* load sample data in the dashboard
* in the Discover pane, add a Filter , select a Field, select 'is' , paste a huge string into the search box - 50k chars will do the trick. The node receiving the query will crash.

I have below follow up questions to ensure I understand this issue correctly.

When filtering on a field value, a query containing a 'suggestions' aggregation is sent to the cluster in the background before the filter is saved, in order to populate an autocomplete drop down

Can you please share the end-point which is used for this query or sample request/response ?

This aggregation includes a regex which is constructed from taking the large string and suffixing with a ".*" . The resulting regexp does not seem to respect the default index.max_regex_length limit of 1000

Where do you see aggregated regex includes complete string and adding ".*".

the query is submitted causing an instant crash of nodes.

Can you please share the opensearch logs showing this instant crash ?

@gplechuck
Copy link
Author

Hey @dreamer-89 ,

Thanks for picking this up. Just tested again using clean container images and reproduced. Just a quick test - but hopefully enough to point you in the right direction for reproducing the issue .

Disabled HTTPS for convenience to grab a traffic capture in order to view the request (alternatively enabling audit logging for the REST interface should work just as well) .

Steps taken -

  • docker-compose up
  • browse opensearch dashboards
  • load sample data
  • go to discover pane
  • pick a field (in this example I chose the ID field)
  • wait for it to start auto populating the drop down with ID values
  • paste a large string
  • opensearch-node1 dies, opensearch dashboards receives an internal server error

Here's a sample from a traffic capture I ran while doing this test -

POST /opensearch_dashboards_sample_data_logs/_search HTTP/1.1
x-opensearch-product-origin: opensearch-dashboards
x-opaque-id: 922beffe-66c6-43a9-96ba-56a839106c7b
content-type: application/json
Host: opensearch-node1:9200
Content-Length: 188
Connection: keep-alive

{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"bool":{"filter":[]}},"aggs":{"suggestions":{"terms":{"field":"_id","include":".*","execution_hint":"map","shard_size":10}}}}HTTP/1.1 200 OK
X-Opaque-Id: 922beffe-66c6-43a9-96ba-56a839106c7b
content-type: application/json; charset=UTF-8
content-length: 746

{"took":90,"timed_out":false,"terminated_early":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":10000,"relation":"gte"},"max_score":null,"hits":[]},"aggregations":{"suggestions":{"doc_count_error_upper_bound":0,"sum_other_doc_count":14064,"buckets":[{"key":"-17fmn4B6PJPuEScb9cD","doc_count":1},{"key":"-17fmn4B6PJPuEScb9gG","doc_count":1},{"key":"-17fmn4B6PJPuESccdlt","doc_count":1},{"key":"-17fmn4B6PJPuESccdpv","doc_count":1},{"key":"-17fmn4B6PJPuEScctv0","doc_count":1},{"key":"-17fmn4B6PJPuEScctz1","doc_count":1},{"key":"-17fmn4B6PJPuEScd-M4","doc_count":1},{"key":"-17fmn4B6PJPuEScd-Q5","doc_count":1},{"key":"-17fmn4B6PJPuEScdN0J","doc_count":1},{"key":"-17fmn4B6PJPuEScdN4K","doc_count":1}]}}}POST /opensearch_dashboards_sample_data_logs/_search HTTP/1.1
x-opensearch-product-origin: opensearch-dashboards
x-opaque-id: ddf6ded0-8e21-4cd0-aaa9-fb20dcf9bd3d
content-type: application/json
Host: opensearch-node1:9200
Content-Length: 60194
Connection: keep-alive

{"size":0,"timeout":"1000ms","terminate_after":100000,"query":{"bool":{"filter":[]}},"aggs":{"suggestions":{"terms":{"field":"_id","include":"abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcab......................................................

Note the first query contains "include": ".*" - this autocompleted the drop down when no text was entered (eg all _id values) . The second query was sent when I pasted 50K chars into the value box - I did not submit the request, just pasted the chars.

I attached the opensearch logs in an earlier comment - see 'opensearch-regex-fatal-error.log' . Here's a snippet from the container stdout viewable in the terminal (please check the earlier attachment for the full error). We can see opensearch-node1 die after which point opensearch-dashboards cannot connect anymore.

...
...
opensearch-node1         |      at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1146)
opensearch-node1         |      at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1146)
opensearch-node1         |      at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1146)
opensearch-node1         |      at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1146)
opensearch-node1         | Killing performance analyzer process 106
opensearch-node1         | OpenSearch exited with code 126
opensearch-node1         | Performance analyzer exited with code 143
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:38Z","tags":["error","opensearch","data"],"pid":1,"message":"Request error, retrying\nPOST http://opensearch-node1:9200/opensearch_dashboards_sample_data_logs/_search => socket hang up"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:38Z","tags":["warning","opensearch","data"],"pid":1,"message":"Unable to revive connection: http://opensearch-node1:9200/"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:38Z","tags":["warning","opensearch","data"],"pid":1,"message":"No living connections"}
opensearch-dashboards    | {"type":"error","@timestamp":"2022-02-10T21:43:37Z","tags":[],"pid":1,"level":"error","error":{"message":"Internal Server Error","name":"Error","stack":"Error: Internal Server Error\n    at HapiResponseAdapter.toError (/usr/share/opensearch-dashboards/src/core/server/http/router/response_adapter.js:145:19)\n    at HapiResponseAdapter.toHapiResponse (/usr/share/opensearch-dashboards/src/core/server/http/router/response_adapter.js:99:19)\n    at HapiResponseAdapter.handle (/usr/share/opensearch-dashboards/src/core/server/http/router/response_adapter.js:94:17)\n    at Router.handle (/usr/share/opensearch-dashboards/src/core/server/http/router/router.js:164:34)\n    at process._tickCallback (internal/process/next_tick.js:68:7)"},"url":{"protocol":null,"slashes":null,"auth":null,"host":null,"port":null,"hostname":null,"hash":null,"search":null,"query":{},"pathname":"/api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs","path":"/api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs","href":"/api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs"},"message":"Internal Server Error"}
opensearch-dashboards    | {"type":"response","@timestamp":"2022-02-10T21:43:37Z","tags":[],"pid":1,"method":"post","statusCode":500,"req":{"url":"/api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs","method":"post","headers":{"host":"127.0.0.1:5601","user-agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0","accept":"*/*","accept-language":"en-US,en;q=0.5","accept-encoding":"gzip, deflate","referer":"http://127.0.0.1:5601/app/discover","content-type":"application/json","osd-version":"1.2.0","origin":"http://127.0.0.1:5601","content-length":"60048","dnt":"1","connection":"keep-alive","sec-fetch-dest":"empty","sec-fetch-mode":"cors","sec-fetch-site":"same-origin"},"remoteAddress":"192.168.32.1","userAgent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0","referer":"http://127.0.0.1:5601/app/discover"},"res":{"statusCode":500,"responseTime":518,"contentLength":9},"message":"POST /api/opensearch-dashboards/suggestions/values/opensearch_dashboards_sample_data_logs 500 518ms - 9.0B"}
opensearch-node1 exited with code 0
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:39Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:39Z","tags":["error","savedobjects-service"],"pid":1,"message":"Unable to retrieve version information from OpenSearch nodes."}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:41Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:44Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:46Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
opensearch-dashboards    | {"type":"log","@timestamp":"2022-02-10T21:43:49Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-node1 opensearch-node1:9200"}
...
...

Hope that helps! Any other questions, shout.

Cheers

@CEHENKLE
Copy link
Member

Hey @dreamer-89 ! Can we get an update on this issue?

Thanks!
/C

@dreamer-89
Copy link
Member

dreamer-89 commented Apr 4, 2022

Apologies @gplechuck for delay on this issue and thank you for sharing the detailed steps for isssue reproduction along with traffic capture. This is very helpful and I can easily reproduce the issue following provided steps.

I am also able to reproduce the bug (node shutdown) locally on my mac system by starting OpenSearch engine and OpenSearch dashboards locally from codebase and following the replication steps.

@dreamer-89
Copy link
Member

Steps to reproduce directly on engine side:

  1. Create index
curl -XPUT localhost:9200/test-index
  1. Search request
curl -XPOST "localhost:9200/test-index/_search?pretty -H 'Content-Type: application/json' -d '{
  "query": {
  	"bool":{"filter":[]}
  },
  "aggs":{"suggestions":{"terms":{"field":"_id","include":"50k+ chars","execution_hint":"map","shard_size":10}}}
}'

@kartg
Copy link
Member

kartg commented Apr 5, 2022

The root cause of this looks to be within Lucene's RegExp implementation and how it parses strings via recursive method calls. Through a chain of calls from the constructor, we reach this recursive call that parses the string literal. In the case of a large input as is the case in this issue, we hit a StackOverflow.

This scenario can be repro'd with the following unit test:

    public void  testStackOverflow() {
        StringBuilder strBuilder = new StringBuilder();
        for (int i = 0; i < 50000; i++) {
            strBuilder.append("a");
        }
        new IncludeExclude(strBuilder.toString(), null);
    }

The max_regex_length setting is ignored in this code path - it is only checked in QueryBuilder and QueryParser implementations.

@CEHENKLE
Copy link
Member

CEHENKLE commented Apr 5, 2022

Hmmmm,.....this may be something we want to submit a fix for upstream in Lucene, regardless of how we mitigate within OpenSearch.

@dblock
Copy link
Member

dblock commented Apr 5, 2022

@kartg Can you please open an issue in Lucene? A StackOverflow may not be a problem, that's a valid exception, though.

@kartg
Copy link
Member

kartg commented Apr 5, 2022

Update - I've written up a mitigation that attempts to respect the max_regex_length when parsing the aggregations in a search query. I'm currently working on writing a test method to verify that the problem is no longer present with this in place. I'm not super happy with how the mitigation is implemented so I'll raise a draft PR to garner feedback first.

As for opening an issue with Lucene, I'm still waiting on my account recovery email. In the meantime, I found an old issue that alludes to this - https://issues.apache.org/jira/browse/LUCENE-6156 . It was subsequently closed citing the fact that Elasticsearch (at the time) "shrinks the jvm default stack size", so Lucene parsing was not the cause of the problem.

Once I'm done with the mitigation, I'll need to investigate if the Opensearch Xss configuration is affecting the unit test I outlined above, or if I can repro the StackOverflow independent of that.

@andrross
Copy link
Member

andrross commented Apr 5, 2022

I think we're just hitting the fact that the recursive algorithm uses one stack frame per regex operation. See this simple test not using OpenSearch at all:

$ ls
RegExpTest.java       lucene-core-9.1.0.jar
$ cat RegExpTest.java
class RegExpTest {
    public static void main(String[] args) {
        StringBuilder strBuilder = new StringBuilder();
        for (int i = 0; i < 50000; i++) {
            strBuilder.append("a");
        }
        try {
            new org.apache.lucene.util.automaton.RegExp(strBuilder.toString());
        } catch (StackOverflowError e) {
            System.out.println("Stack overflow");
            System.exit(-1);
        }
        System.out.println("Success");
    }
}
$ javac -cp './lucene-core-9.1.0.jar:.' RegExpTest.java
$ java -cp './lucene-core-9.1.0.jar:.' RegExpTest
Stack overflow
$ java -Xss1G -cp './lucene-core-9.1.0.jar:.' RegExpTest
Success

@kartg
Copy link
Member

kartg commented Apr 5, 2022

Thanks @andrross! FWIW, i've cut an issue with Lucene, though they may just state this is a limitation of the recursive pattern and it isn't possible to move to a non-recursive model.

https://issues.apache.org/jira/browse/LUCENE-10501

@kartg
Copy link
Member

kartg commented Apr 6, 2022

My original attempt at a mitigation failed miserably, so I'm back at the drawing board.

To reiterate, this change does not seek to prevent the StackOverflow error from reg-ex parsing; doing so is not feasible since the root cause lies within the Lucene implementation and the overflow threshold is dictated by a JVM setting. Instead, we're seeking to correctly enforce the index.regex_max_length index-level setting. I believe this is currently enforced for the search query itself (via QueryStringQueryParser), but not for aggregations.

The complexity here stems from the fact that the logic for parsing the "include" reg-ex is set up at bootstrap/startup time across multiple term parsers (example). Since there is no notion of a "index" in this context, the index-level setting/limit cannot be retrieved/applied here.

The right location to enforce this would be at the runtime point where a search query against an index arrives at the node and must be parsed. In the ideal case, the QueryContext object would then be passed to the IncludeExclude parsing implementation, which would own the enforcement of the regex length limit.

@dblock
Copy link
Member

dblock commented Apr 6, 2022

@kartg The major problem is that the node dies in this case, isn't it? Shouldn't the fix be a catch all that causes the query to fail instead of the process to die? Then avoiding the exception becomes desirable but not critical.

@andrross
Copy link
Member

andrross commented Apr 6, 2022

Shouldn't the fix be a catch all that causes the query to fail instead of the process to die?

Catching the StackOverflowError could be a tactical fix but I'm pretty wary of it. Catching java.lang.Errors is generally a bad idea. In this case I'd be worried about things like other threads running into problems during the same moment that this thread exhausts the stack size. <- The stack size limit might be per-thread in Java, so this may not be an actual problem, but it doesn't change the fact that I'm still wary about handling Errors.

@kartg Would it make things easier if IncludeExclude was refactored not to store RegExp instances as fields, and instead just store the original string and then create the RegExp instances on demand? It looks like most of the usages of those RegExp fields is to retrieve the original string anyway, and if I'm reading it right there's only one place where it is needed as a RegExp instance. It might be easier to plumb in the length enforcement check at that point.

@kartg
Copy link
Member

kartg commented Apr 6, 2022

Would it make things easier if IncludeExclude was refactored not to store RegExp instances as fields, and instead just store the original string and then create the RegExp instances on demand?

That's a really good idea - I think this approach might work. I'll work on it.

Even with the limit check in place, there's no guarantee that we won't still hit a StackOverflowError, so I think it makes sense to implement a check for the error to try and avoid the entire node going down.

@andrross @dblock what do you both think? Should we implement both guardrails?

@andrross
Copy link
Member

andrross commented Apr 6, 2022

Even with the limit check in place, there's no guarantee that we won't still hit a StackOverflowError

As long as the default regex limit combined with the default Xss setting means that it is impossible to hit the stack overflow condition, then I don't think we need additional handling logic. I'm not sure that we need to be super defensive against mis-configurations provided the defaults are safe.

@dblock
Copy link
Member

dblock commented Apr 6, 2022

I think I agree with @andrross.

On catchalls, it sounds like in general we want the node to go down if it, for example, runs out of memory or goes into some infinite recursion? Is this by design and do we spell this out anywhere?

@kartg kartg assigned kartg and unassigned dreamer-89 Apr 6, 2022
@dreamer-89
Copy link
Member

dreamer-89 commented Apr 6, 2022

Sharing my observations on this issue till now. I see more variants of aggregations, are broken currently (listed below); all of them belonging to bucket aggregations. I think proposed solution should be able to address them all. I also spot checked other version of _search API utilizing regex fields; but didn't find any which crashed the node. I also verified that issue persists and impacts the latest version of elasticsearch server.

  1. Term aggregation (this issue)
{
    "query": {
        "match": {
            "field":"hello"
        }
    },
    "aggregations": {
        "my_agg": {
            "terms": {
                "field": "field_name",
                "include": "50k+ chars"
            }
        }
    }
}
  1. Rare terms aggregations
{
...
    "aggs": {
        "rare_term_agg": {
            "rare_terms": {
         ...
            }
        }
    }
}
  1. Significant term aggregations
...
    "aggregations": {
        "sig_term_agg": {
            "significant_terms": {
           ...
            }
        }
    }
}
  1. Significant text aggregations
...
    "aggregations": {
        "sig_text_agg": {
            "significant_text": {
          ...
            }
        }
    }
}

@kartg
Copy link
Member

kartg commented Apr 7, 2022

Update - Done with the code change, and my empirical testing shows that the node no longer falls over for large strings. However, this appears to be from the fact that we're no longer eagerly creating a RegExp instance and not from the max_regex_length limit being enforced. I'm still working on identifying and verifying the code paths that would hit this limit.

@kartg
Copy link
Member

kartg commented Apr 7, 2022

Confirmed that delaying the parsing of the RegExp mitigates the issue and correctly enforces the regex limit. The error in my testing above was because I was querying for a field (_id) that did not exist on the index (test-index). The easy workaround to this is to add a missing clause.

Thus, to expand on @dreamer-89 's repro steps:

curl -XPUT localhost:9200/test-index

then

curl -XPOST "localhost:9200/test-index/_search?pretty -H 'Content-Type: application/json' -d '{
  "query": {
  	"bool":{"filter":[]}
  },
  "aggs":{"suggestions":{"terms":{"field":"_id","include":"<large string here>","execution_hint":"map","missing":"test","shard_size":10}}}
}'

@dblock
Copy link
Member

dblock commented Apr 7, 2022

@dreamer-89 Given that this takes a node out, mind checking whether Elasticsearch has this issue today, and opening a bug for them, please?

elastic/elasticsearch#82923

@dreamer-89
Copy link
Member

@dreamer-89 Given that this takes a node out, mind checking whether Elasticsearch has this issue today, and opening a bug for them, please?

@dblock: Thanks for your comment. I verified yesterday that bug persist in latest version of Elasticsearch here. Thank you for finding and sharing the bug :)

@gplechuck
Copy link
Author

Apologies @gplechuck for delay on this issue and thank you for sharing the detailed steps for isssue reproduction along with traffic capture. This is very helpful and I can easily reproduce the issue following provided steps.

I am also able to reproduce the bug (node shutdown) locally on my mac system by starting OpenSearch engine and OpenSearch dashboards locally from codebase and following the replication steps.

No problem, and thanks for following up on it .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing & Search Severity-Critical v2.0.0 Version 2.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants