Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/_cat/indices: race condition/exception thrown when creating/deleting indices rapidly #17395

Closed
joshuar opened this issue Mar 30, 2016 · 6 comments · Fixed by #18545
Closed
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) help wanted adoptme

Comments

@joshuar
Copy link
Contributor

joshuar commented Mar 30, 2016

Elasticsearch Versions:
2.1.x - 2.2.x

JVM Version:
OpenJDK 64-Bit Server VM (build 25.77-b03, mixed mode)

Description of the problem including expected versus actual behavior:

It looks like if you create and delete indices very rapidly you can sometimes end up getting a 404 index_not_found_exception from the /_cat/indices API. So basically rather than seeing the indices that still exist, you get an exception because one index was created and deleted between when the endpoint resolves indices and then call indices stats api for those indices.

Steps to reproduce:

Attached is a script that should hopefully reproduce the problem, but it can take some time (few minutes depending on the ability of your ES cluster to handle responses :)

cat-indices-issue.sh.zip

The script is Bash. It will run a number of background processes that simply spin creating and then deleting an index with the REST API. The script waits until the output of _cat/indices produces the exception, then exits. NOTE: It will clean-up any background processes it creates but won't clean up any test indices that might still exist.

When the script dies, it will produce output like:

{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"testindex3","index":"testindex3"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_or_alias","resource.id":"testindex3","index":"testindex3"},"status":404}
...output of the script cleaning up after itself...
@kwilczynski
Copy link

@joshuar thank you for logging this! (via @faeldt)

To add, sometimes the "type": "null_pointer_exception" (as an addition to the above; sorry I don't have full JSON output at hand) would show up when querying /_cat/indices or /_cat/shards, etc.

Since this is a machine parseable output, it would be delightful for it to suppress any error and just not return JSON (or any errors for that matter, unless you add ?e=true or something).

@clintongormley
Copy link
Contributor

@kwilczynski if you happen to come across this NPE again, please paste the stack trace from the logs into this issue

@kwilczynski
Copy link

@clintongormley I will try to fish something on the Elasticsearch side, as since I fixed this on my side (to retry on error, and I don't log in debug in production, I lost it for now, sadly).

@kwilczynski
Copy link

@clintongormley I found the following NPE going through the older logs:

RemoteTransportException[[prd11-c-tky-master-search-catalogpf][10.184.20.209:9300][cluster:monitor/stats[n]]]; nested: NullPointerException;
:Caused by: java.lang.NullPointerException
     at org.elasticsearch.action.admin.cluster.health.ClusterIndexHealth.<init>(ClusterIndexHealth.java:73)
     at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:127)
     at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:58)
     at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:211)
     at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:207)
     at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
     at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     at java.lang.Thread.run(Thread.java:745)

@clintongormley
Copy link
Contributor

thanks @kwilczynski

@djschny
Copy link
Contributor

djschny commented May 9, 2016

Note I also am receiving similar with _cat/indices on ES 2.3.2. It happens if I'm running a node with multiple path.data entries and remove one of the drives that was on the path.data:

{
   "error": {
      "root_cause": [
         {
            "type": "null_pointer_exception",
            "reason": null
         }
      ],
      "type": "null_pointer_exception",
      "reason": null
   },
   "status": 500
}

Withe following stacktrace from the node:

[2016-05-09 11:40:24,520][WARN ][rest.suppressed          ] /_cat/indices Params: {}
java.lang.NullPointerException
    at org.elasticsearch.rest.action.cat.RestIndicesAction.buildTable(RestIndicesAction.java:331)
    at org.elasticsearch.rest.action.cat.RestIndicesAction.access$100(RestIndicesAction.java:52)
    at org.elasticsearch.rest.action.cat.RestIndicesAction$1$1$1.buildResponse(RestIndicesAction.java:97)
    at org.elasticsearch.rest.action.cat.RestIndicesAction$1$1$1.buildResponse(RestIndicesAction.java:94)
    at org.elasticsearch.rest.action.support.RestResponseListener.processResponse(RestResponseListener.java:43)
    at org.elasticsearch.rest.action.support.RestActionListener.onResponse(RestActionListener.java:49)
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:89)
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:378)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:347)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:319)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:311)
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:819)
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:803)
    at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:793)
    at org.elasticsearch.transport.DelegatingTransportChannel.sendResponse(DelegatingTransportChannel.java:58)
    at org.elasticsearch.transport.RequestHandlerRegistry$TransportChannelWrapper.sendResponse(RequestHandlerRegistry.java:134)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:412)
    at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:386)
    at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

ywelsch added a commit to ywelsch/elasticsearch that referenced this issue May 24, 2016
ywelsch added a commit that referenced this issue May 25, 2016
Closed indices are already displayed when no indices are explicitly selected. This commit ensures that closed indices are also shown when wildcard filtering is used. It also addresses another issue that is caused by the fact that the cat action is based internally on 3 different cluster states (one when we query the cluster state to get all indices, one when we query cluster health, and one when we query indices stats). We currently fail the cat request when the user specifies a concrete index as parameter that does not exist. The implementation works as intended in that regard. It checks this not only for the first cluster state request, but also the subsequent indices stats one. This means that if the index is deleted before the cat action has queried the indices stats, it rightfully fails. In case the user provides wildcards (or no parameter at all), however, we fail the indices stats as we pass the resolved concrete indices to the indices stats request and fail to distinguish whether these indices have been resolved by wildcards or explicitly requested by the user. This means that if an index has been deleted before the indices stats request gets to execute, we fail the overall cat request. The fix is to let the indices stats request do the resolving again and not pass the concrete indices.

Closes #16419
Closes #17395
ywelsch added a commit that referenced this issue May 25, 2016
Closed indices are already displayed when no indices are explicitly selected. This commit ensures that closed indices are also shown when wildcard filtering is used. It also addresses another issue that is caused by the fact that the cat action is based internally on 3 different cluster states (one when we query the cluster state to get all indices, one when we query cluster health, and one when we query indices stats). We currently fail the cat request when the user specifies a concrete index as parameter that does not exist. The implementation works as intended in that regard. It checks this not only for the first cluster state request, but also the subsequent indices stats one. This means that if the index is deleted before the cat action has queried the indices stats, it rightfully fails. In case the user provides wildcards (or no parameter at all), however, we fail the indices stats as we pass the resolved concrete indices to the indices stats request and fail to distinguish whether these indices have been resolved by wildcards or explicitly requested by the user. This means that if an index has been deleted before the indices stats request gets to execute, we fail the overall cat request. The fix is to let the indices stats request do the resolving again and not pass the concrete indices.

Closes #16419
Closes #17395
ywelsch added a commit that referenced this issue May 25, 2016
Closed indices are already displayed when no indices are explicitly selected. This commit ensures that closed indices are also shown when wildcard filtering is used. It also addresses another issue that is caused by the fact that the cat action is based internally on 3 different cluster states (one when we query the cluster state to get all indices, one when we query cluster health, and one when we query indices stats). We currently fail the cat request when the user specifies a concrete index as parameter that does not exist. The implementation works as intended in that regard. It checks this not only for the first cluster state request, but also the subsequent indices stats one. This means that if the index is deleted before the cat action has queried the indices stats, it rightfully fails. In case the user provides wildcards (or no parameter at all), however, we fail the indices stats as we pass the resolved concrete indices to the indices stats request and fail to distinguish whether these indices have been resolved by wildcards or explicitly requested by the user. This means that if an index has been deleted before the indices stats request gets to execute, we fail the overall cat request. The fix is to let the indices stats request do the resolving again and not pass the concrete indices.

Closes #16419
Closes #17395
@lcawl lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) help wanted adoptme
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants