Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XPack Usage API can return 503 due to missing primary #45250

Closed
Mpdreamz opened this issue Aug 6, 2019 · 6 comments
Closed

XPack Usage API can return 503 due to missing primary #45250

Mpdreamz opened this issue Aug 6, 2019 · 6 comments

Comments

@Mpdreamz
Copy link
Member

Mpdreamz commented Aug 6, 2019

GET /_xpack/usage?pretty=true&error_trace=true

{
  "error" : {
    "root_cause" : [
      {
        "type" : "unavailable_shards_exception",
        "reason" : "at least one primary shard for the security index is unavailable",
        "stack_trace" : "UnavailableShardsException[at least one primary shard for the security index is unavailable]\r\n\tat org.elasticsearch.xpack.security.support.SecurityIndexManager.getUnavailableReason(SecurityIndexManager.java:146)\r\n\tat org.elasticsearch.xpack.security.authc.esnative.NativeUsersStore.getUserCount(NativeUsersStore.java:171)\r\n\tat org.elasticsearch.xpack.security.authc.esnative.NativeRealm.lambda$usageStats$1(NativeRealm.java:53)\r\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)\r\n\tat org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.lambda$usageStats$5(CachingUsernamePasswordRealm.java:203)\r\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)\r\n\tat org.elasticsearch.xpack.core.security.authc.Realm.usageStats(Realm.java:137)\r\n\tat org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.usageStats(CachingUsernamePasswordRealm.java:201)\r\n\tat org.elasticsearch.xpack.security.authc.esnative.NativeRealm.usageStats(NativeRealm.java:52)\r\n\tat org.elasticsearch.xpack.security.authc.Realms.usageStats(Realms.java:243)\r\n\tat org.elasticsearch.xpack.security.SecurityFeatureSet.usage(SecurityFeatureSet.java:144)\r\n\tat org.elasticsearch.xpack.core.action.TransportXPackUsageAction.lambda$masterOperation$0(TransportXPackUsageAction.java:70)\r\n\tat org.elasticsearch.xpack.core.common.IteratingActionListener.run(IteratingActionListener.java:102)\r\n\tat org.elasticsearch.xpack.core.action.TransportXPackUsageAction.masterOperation(TransportXPackUsageAction.java:93)\r\n\tat org.elasticsearch.xpack.core.action.TransportXPackUsageAction.masterOperation(TransportXPackUsageAction.java:31)\r\n\tat org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127)\r\n\tat org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)\r\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:835)\r\n"
      }
    ],
    "type" : "unavailable_shards_exception",
    "reason" : "at least one primary shard for the security index is unavailable",
    "stack_trace" : "UnavailableShardsException[at least one primary shard for the security index is unavailable]\r\n\tat org.elasticsearch.xpack.security.support.SecurityIndexManager.getUnavailableReason(SecurityIndexManager.java:146)\r\n\tat org.elasticsearch.xpack.security.authc.esnative.NativeUsersStore.getUserCount(NativeUsersStore.java:171)\r\n\tat org.elasticsearch.xpack.security.authc.esnative.NativeRealm.lambda$usageStats$1(NativeRealm.java:53)\r\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)\r\n\tat org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.lambda$usageStats$5(CachingUsernamePasswordRealm.java:203)\r\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)\r\n\tat org.elasticsearch.xpack.core.security.authc.Realm.usageStats(Realm.java:137)\r\n\tat org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.usageStats(CachingUsernamePasswordRealm.java:201)\r\n\tat org.elasticsearch.xpack.security.authc.esnative.NativeRealm.usageStats(NativeRealm.java:52)\r\n\tat org.elasticsearch.xpack.security.authc.Realms.usageStats(Realms.java:243)\r\n\tat org.elasticsearch.xpack.security.SecurityFeatureSet.usage(SecurityFeatureSet.java:144)\r\n\tat org.elasticsearch.xpack.core.action.TransportXPackUsageAction.lambda$masterOperation$0(TransportXPackUsageAction.java:70)\r\n\tat org.elasticsearch.xpack.core.common.IteratingActionListener.run(IteratingActionListener.java:102)\r\n\tat org.elasticsearch.xpack.core.action.TransportXPackUsageAction.masterOperation(TransportXPackUsageAction.java:93)\r\n\tat org.elasticsearch.xpack.core.action.TransportXPackUsageAction.masterOperation(TransportXPackUsageAction.java:31)\r\n\tat org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:127)\r\n\tat org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:208)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751)\r\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:835)\r\n"
  },
  "status" : 503
}

Hitting this intermittently on our CI, do not have an exact reproduce playlist.

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this issue Aug 6, 2019
@jasontedor
Copy link
Member

It looks like it's the security index that is unavailable. Are you waiting until the cluster is green before you start firing requests after it bootstraps for the first time?

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@Mpdreamz
Copy link
Member Author

Mpdreamz commented Aug 6, 2019

All integration tests wait for yellow but this particular request was issued after the node had been running for a while:

[2019-08-06T16:05:11,694][INFO ][o.e.n.Node               ] [xpack-node-6a4cb29200] started
[2019-08-06T04:05:11,704][INFO ][Managed Elasticsearch    ]  {ValidateRunningVersion} validating the cluster is running the requested version: 7.0.0
[2019-08-06T04:05:11,706][INFO ][Managed Elasticsearch    ]  {Call} [https://localhost:9200/_cat/nodes?h=version&pretty=true] SSL: True Security: True
[2019-08-06T16:05:12,466][INFO ][o.e.c.m.MetaDataIndexTemplateService] [xpack-node-6a4cb29200] adding template [.watches] for index patterns [.watches*]
[2019-08-06T16:05:12,653][INFO ][o.e.c.m.MetaDataIndexTemplateService] [xpack-node-6a4cb29200] adding template [.watch-history-9] for index patterns [.watcher-history-9*]
[2019-08-06T16:05:12,919][INFO ][o.e.c.m.MetaDataIndexTemplateService] [xpack-node-6a4cb29200] adding template [.monitoring-logstash] for index patterns [.monitoring-logstash-7-*]
[2019-08-06T04:05:13,079][INFO ][Managed Elasticsearch    ]  {ValidateClusterStateTask} waiting cluster to go into yellow health state
[2019-08-06T04:05:13,080][INFO ][Managed Elasticsearch    ]  {Call} [https://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=20s&pretty=true] SSL: True Security: True
[2019-08-06T16:05:13,184][INFO ][o.e.c.m.MetaDataIndexTemplateService] [xpack-node-6a4cb29200] adding template [.monitoring-es] for index patterns [.monitoring-es-7-*]
[2019-08-06T04:05:13,226][INFO ][Managed Elasticsearch    ]  {PostLicenseTask} no license file available to post
[2019-08-06T04:05:13,227][INFO ][Managed Elasticsearch    ]  {PostLicenseTask} attempt to start trial license
[2019-08-06T04:05:13,228][INFO ][Managed Elasticsearch    ]  {Call} [https://localhost:9200/_xpack/license/start_trial?acknowledge=true&pretty=true] SSL: True Security: True
[2019-08-06T16:05:13,344][INFO ][o.e.c.m.MetaDataIndexTemplateService] [xpack-node-6a4cb29200] adding template [.monitoring-beats] for index patterns [.monitoring-beats-7-*]
[2019-08-06T16:05:13,437][INFO ][o.e.c.m.MetaDataIndexTemplateService] [xpack-node-6a4cb29200] adding template [.monitoring-alerts-7] for index patterns [.monitoring-alerts-7]
[2019-08-06T16:05:13,516][INFO ][o.e.c.m.MetaDataIndexTemplateService] [xpack-node-6a4cb29200] adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-7-*]
[2019-08-06T16:05:13,984][INFO ][o.e.l.LicenseService     ] [xpack-node-6a4cb29200] license [c65e72c4-243d-4175-8ef9-bcc37f58d384] mode [basic] - valid
[2019-08-06T16:05:13,984][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [xpack-node-6a4cb29200] adding index lifecycle policy [watch-history-ilm-policy]
[2019-08-06T16:05:14,361][INFO ][o.e.l.LicenseService     ] [xpack-node-6a4cb29200] license [38365b5a-8320-4c61-8c0d-4517b7dd62eb] mode [trial] - valid
[2019-08-06T04:05:14,372][INFO ][Managed Elasticsearch    ]  {ValidateLicenseTask} validating the x-pack license is active
[2019-08-06T04:05:14,374][INFO ][Managed Elasticsearch    ]  {Call} [https://localhost:9200/_xpack/license?filter_path=license.type&pretty=true] SSL: True Security: True
[2019-08-06T04:05:14,507][INFO ][Managed Elasticsearch    ]  {Call} [https://localhost:9200/_xpack/license?filter_path=license.status&pretty=true] SSL: True Security: True
[2019-08-06T04:05:14,564][INFO ][Managed Elasticsearch    ]  All good! kicking off [XPackCluster] tests now

Messages with section [Managed Elasticsearch ] originate from our test bootstrap.

After a bunch of tests have already run I see the following:

[2019-08-06T16:05:22,245][INFO ][o.e.x.s.s.SecurityIndexManager] [xpack-node-6a4cb29200] security index does not exist. Creating [.security-7] with alias [.security]
[2019-08-06T16:05:22,339][INFO ][o.e.c.m.MetaDataCreateIndexService] [xpack-node-6a4cb29200] [.watches] creating index, cause [auto(bulk api)], templates [.watches, nest_tests], shards [1]/[0], mappings [_doc]
[xUnit.net 00:10:47.3430471]     Tests.XPack.CrossClusterReplication.CrossClusterReplicationAutoFollowTests.GlobalStatsResponse [SKIP]
[xUnit.net 00:10:47.3441663]     Tests.XPack.CrossClusterReplication.CrossClusterReplicationAutoFollowTests.GetReturnsAndIsMapped [SKIP]
Skipped  Tests.XPack.CrossClusterReplication.CrossClusterReplicationAutoFollowTests.GlobalStatsResponse
Skipped  Tests.XPack.CrossClusterReplication.CrossClusterReplicationAutoFollowTests.GetReturnsAndIsMapped
[xUnit.net 00:10:47.3460528]     Tests.XPack.CrossClusterReplication.CrossClusterReplicationAutoFollowTests.DeleteIsAcked [SKIP]
[xUnit.net 00:10:47.3471173]     Tests.XPack.CrossClusterReplication.CrossClusterReplicationAutoFollowTests.CreateIsAcked [SKIP]
Skipped  Tests.XPack.CrossClusterReplication.CrossClusterReplicationAutoFollowTests.DeleteIsAcked
Skipped  Tests.XPack.CrossClusterReplication.CrossClusterReplicationAutoFollowTests.CreateIsAcked
[2019-08-06T16:05:23,679][INFO ][o.e.c.m.MetaDataMappingService] [xpack-node-6a4cb29200] [.security-7/h2cIl6qtR1qYPi-mU4976Q] update_mapping [_doc]
[2019-08-06T16:05:23,992][INFO ][o.e.c.r.a.AllocationService] [xpack-node-6a4cb29200] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.watches][0]] ...]).
[2019-08-06T16:05:24,382][INFO ][o.e.x.w.WatcherService   ] [xpack-node-6a4cb29200] reloading watcher, reason [new local watcher shard allocation ids], cancelled [0] queued tasks
[2019-08-06T16:05:24,395][INFO ][o.e.c.m.MetaDataMappingService] [xpack-node-6a4cb29200] [.security-7/h2cIl6qtR1qYPi-mU4976Q] update_mapping [_doc]
[2019-08-06T16:05:24,411][INFO ][o.e.c.m.MetaDataMappingService] [xpack-node-6a4cb29200] [.security-7/h2cIl6qtR1qYPi-mU4976Q] update_mapping [_doc]
[2019-08-06T16:05:24,770][INFO ][o.e.x.s.a.u.TransportPutUserAction] [xpack-node-6a4cb29200] added user [nest-fluentasync-43e6e58d]
[2019-08-06T16:05:24,889][INFO ][o.e.c.m.MetaDataMappingService] [xpack-node-6a4cb29200] [.watches/tu3M-9Y4SlKJnA6Xoual2w] update_mapping [_doc]
[2019-08-06T16:05:25,537][INFO ][o.e.x.s.a.u.TransportPutUserAction] [xpack-node-6a4cb29200] added user [nest-initializer-ec7fe32f]
[2019-08-06T16:05:26,521][INFO ][o.e.x.s.a.u.TransportPutUserAction] [xpack-node-6a4cb29200] added user [nest-initializerasync-0c722209]
[2019-08-06T16:05:26,974][INFO ][o.e.c.m.MetaDataMappingService] [xpack-node-6a4cb29200] [.security-7/h2cIl6qtR1qYPi-mU4976Q] update_mapping [_doc]
[2019-08-06T16:05:27,161][INFO ][o.e.x.s.a.r.TransportPutRoleAction] [xpack-node-6a4cb29200] added role [role-f-ee1d0ee8]
[2019-08-06T16:05:27,193][INFO ][o.e.c.m.MetaDataCreateIndexService] [xpack-node-6a4cb29200] [test] creating index, cause [auto(bulk api)], templates [nest_tests], shards [2]/[0], mappings []
[2019-08-06T16:05:27,333][INFO ][o.e.x.s.a.u.TransportPutUserAction] [xpack-node-6a4cb29200] added user [nest-fluent-1ca3f063]
[xUnit.net 00:10:51.8810514]     Tests.XPack.Info.XPackInfoApiTests.XPackUsageResponse [FAIL]

The first line indicate a new .secuity index being created but the tests does not fail till a a couple of seconds later.

This is a single node on 7.0.0, hardware unknown (azure devops worker node).

@Mpdreamz
Copy link
Member Author

Mpdreamz commented Aug 6, 2019

Should mention we wait for yellow once at the start of the test run after the node has started. We do not wait for yellow before every single test.

@jasontedor
Copy link
Member

I think that Elasticsearch is doing the right thing here. You are trying to collect usage stats. Collecting usage stats involves collecting usage from stats from the realms, including the native realm. That involves collecting user counts from the native users store. If the security index isn't available, then we can't return the native user count. I don't think we should return partial results, I think we should fail the request, as is being done here. I think that you need to adapt your CI appropriately.

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this issue Aug 7, 2019
* Update abstractions to version that prints JAVA_HOME just before starting elasticsearch and forces it on the Process instance

(cherry picked from commit ee496ad)

* update abstractions

(cherry picked from commit 2b409ff)

* update to abstraction versions that takes prereleases into account again with SkipVersion

* Audit trail test assumed timespan is never TimeSpan.Zero but of course can be (#3989)


(cherry picked from commit 79a71ac)

* Rename SkipOnTeamCity to skip on CI

* skip xpack usage/info on CI, tad flakey due to elastic/elasticsearch#45250
@Mpdreamz
Copy link
Member Author

Mpdreamz commented Aug 7, 2019

I am happy to change my CI to deal with potential failures due to the security index not being available.

I do want to report this happens reasonably often and affects other API's too such as the Create API key API which internally does a _search first as well.

Would it be helpful for these to internally to wait for yellow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants