Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] denying access as action [internal:admin/tasks/ban] is not an index or cluster action #54887

Closed
droberts195 opened this issue Apr 7, 2020 · 4 comments · Fixed by #59027
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

droberts195 commented Apr 7, 2020

An ML test failed in https://gradle-enterprise.elastic.co/s/d5tipghjjliwc

The error was:

Failure at [mixed_cluster/90_ml_data_frame_analytics_crud:83]: expected [2xx] status code but api [ml.stop_data_frame_analytics] returned [409 Conflict] [{"error":{"root_cause":[{"type":"status_exception","reason":"cannot close data frame analytics [old_cluster_regression_job] because it failed, use force stop instead","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close data frame analytics [old_cluster_regression_job] because it failed, use force stop instead\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:67)\n\tat org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.normalStop(TransportStopDataFrameAnalyticsAction.java:143)\n\tat org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$doExecute$0(TransportStopDataFrameAnalyticsAction.java:111)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$expandIds$3(TransportStopDataFrameAnalyticsAction.java:131)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.ml.dataframe.persistence.DataFrameAnalyticsConfigProvider.lambda$getMultiple$3(DataFrameAnalyticsConfigProvider.java:133)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.client.node.NodeClient.lambda$executeLocally$0(NodeClient.java:91)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:158)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:151)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.xpack.ml.action.TransportGetDataFrameAnalyticsAction.lambda$doExecute$0(TransportGetDataFrameAnalyticsAction.java:63)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.core.action.AbstractTransportGetResourcesAction$1.onResponse(AbstractTransportGetResourcesAction.java:124)\n\tat org.elasticsearch.xpack.core.action.AbstractTransportGetResourcesAction$1.onResponse(AbstractTransportGetResourcesAction.java:97)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.client.node.NodeClient.lambda$executeLocally$0(NodeClient.java:91)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:158)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:151)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:545)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:117)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:350)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:344)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:231)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:119)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:125)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:95)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:691)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.lang.Thread.run(Thread.java:834)\n"}],"type":"status_exception","reason":"cannot close data frame analytics [old_cluster_regression_job] because it failed, use force stop instead","stack_trace":"org.elasticsearch.ElasticsearchStatusException: cannot close data frame analytics [old_cluster_regression_job] because it failed, use force stop instead\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:67)\n\tat org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.normalStop(TransportStopDataFrameAnalyticsAction.java:143)\n\tat org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$doExecute$0(TransportStopDataFrameAnalyticsAction.java:111)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$expandIds$3(TransportStopDataFrameAnalyticsAction.java:131)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.ml.dataframe.persistence.DataFrameAnalyticsConfigProvider.lambda$getMultiple$3(DataFrameAnalyticsConfigProvider.java:133)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.client.node.NodeClient.lambda$executeLocally$0(NodeClient.java:91)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:158)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:151)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.xpack.ml.action.TransportGetDataFrameAnalyticsAction.lambda$doExecute$0(TransportGetDataFrameAnalyticsAction.java:63)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.xpack.core.action.AbstractTransportGetResourcesAction$1.onResponse(AbstractTransportGetResourcesAction.java:124)\n\tat org.elasticsearch.xpack.core.action.AbstractTransportGetResourcesAction$1.onResponse(AbstractTransportGetResourcesAction.java:97)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.client.node.NodeClient.lambda$executeLocally$0(NodeClient.java:91)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:158)\n\tat org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:151)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:545)\n\tat org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:117)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:350)\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:344)\n\tat org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:231)\n\tat org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:119)\n\tat org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:125)\n\tat org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:95)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:691)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.lang.Thread.run(Thread.java:834)\n"},"status":409}]

However, the server side logs (downloadable from https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_RUNTIME_JAVA=adoptopenjdk11,nodes=general-purpose/616/gcsObjects/) show the following error was the root cause:

[2020-04-07T05:23:07,037][WARN ][o.e.x.s.a.AuthorizationService] [v8.0.0-1] denying access as action [internal:admin/tasks/ban] is not an index or cluster action

That message is from x-pack/qa/rolling-upgrade/build/testclusters/v8.0.0-1/logs/v8.0.0.log.

It looks like the ban action needs to be called from within the system security context, and this does not always happen (although it must happen sometimes by coincidence as the test failure is quite rare).

@droberts195 droberts195 added >test-failure Triaged test failures from CI :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. labels Apr 7, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Task Management)

@dnhatn dnhatn added :ml Machine learning and removed :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. labels Apr 17, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@dnhatn
Copy link
Member

dnhatn commented Apr 17, 2020

[2020-04-07T05:23:07,037][WARN ][o.e.x.s.a.AuthorizationService] [v8.0.0-1] denying access as action [internal:admin/tasks/ban] is not an index or cluster action

I took a look at the failure. The warning was logged when we were sending unban requests; thus, it should not be the cause of the failure. I have integrated #55404 to make sure that unban requests will be sent in the right context.

@droberts195 Would you like to continue investigating this failure, or let's close and wait for the next occurrence? I have removed my assignment and updated the label as I think the task framework issue is resolved. Thank you for reporting this :).

@dnhatn dnhatn removed their assignment Apr 17, 2020
@droberts195
Copy link
Contributor Author

OK, I think the ML problem is probably that we need to wait for indices to have their replicas allocated before upgrading during the rolling upgrade tests. There's a message in the log of this test Updated analytics task state to [failed] with reason [all shards failed]. This happens in the mixed cluster. So I guess what happened is that the old cluster part of the ML test that failed was randomly selected to go last in the old cluster, it created an index, then the node that the primary was allocated to got killed for the rolling upgrade, then in the mixed cluster that index was completely lost. We have had this problem in other tests in the past and have solved it by waiting for green status on all indices that must survive rolling upgrade in the old cluster tests so that we can be sure they have both primary and replica allocated before a node is killed.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jul 3, 2020
There have been a few test failures that are likely caused by tests
performing actions that use ML indices immediately after the actions
that create those ML indices.  Currently this can result in attempts
to search the newly created index before its shards have initialized.

This change makes the method that creates the internal ML indices
that have been affected by this problem (state and stats) wait for
the shards to be initialized before returning.

Fixes elastic#54887
Fixes elastic#55221
Fixes elastic#55807
Fixes elastic#57102
Fixes elastic#58841
Fixes elastic#59011
droberts195 added a commit that referenced this issue Jul 6, 2020
…#59027)

There have been a few test failures that are likely caused by tests
performing actions that use ML indices immediately after the actions
that create those ML indices.  Currently this can result in attempts
to search the newly created index before its shards have initialized.

This change makes the method that creates the internal ML indices
that have been affected by this problem (state and stats) wait for
the shards to be initialized before returning.

Fixes #54887
Fixes #55221
Fixes #55807
Fixes #57102
Fixes #58841
Fixes #59011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants