Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure of BasicDistributedJobsIT testFailOverBasics_withDataFeeder test #44566

Closed
przemekwitek opened this issue Jul 18, 2019 · 9 comments · Fixed by #49493
Closed

Failure of BasicDistributedJobsIT testFailOverBasics_withDataFeeder test #44566

przemekwitek opened this issue Jul 18, 2019 · 9 comments · Fixed by #49493
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@przemekwitek
Copy link
Contributor

FAILURE elastic/elasticsearch#master 51fb95e multijob+fast+part2 - 20190718054134-22C92686

org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT > testFailOverBasics_withDataFeeder FAILED
    java.lang.RuntimeException: Had to resort to force-stopping datafeed, something went wrong?

        Caused by:
        java.util.concurrent.ExecutionException: RemoteTransportException[[node_t1][127.0.0.1:36020][cluster:admin/persistent/update_status]]; nested: ResourceNotFoundException[the task with id datafeed-data_feed_id and allocation id 13 doesn't exist];

            Caused by:
            RemoteTransportException[[node_t1][127.0.0.1:36020][cluster:admin/persistent/update_status]]; nested: ResourceNotFoundException[the task with id datafeed-data_feed_id and allocation id 13 doesn't exist];

                Caused by:
                ResourceNotFoundException[the task with id datafeed-data_feed_id and allocation id 13 doesn't exist]

Log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+fast+part2/304/console
Build Scans: https://gradle.com/s/ezjbyxaffinmk

@przemekwitek przemekwitek added >test-failure Triaged test failures from CI :ml Machine learning labels Jul 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@davidkyle davidkyle changed the title Failure of org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT.testFailOverBasics_withDataFeeder test Failure of BasicDistributedJobsIT testFailOverBasics_withDataFeeder test Jul 18, 2019
@davidkyle
Copy link
Member

The test fails because of the clean up to stop datafeeds at the end of the test throws an exception

ResourceNotFoundException[the task with id datafeed-data_feed_id and allocation id 13 doesn't exist]

This comes from PersistentTasksClusterService::updatePersistentTaskState because the task has already been removed. The part of the log file at the task clean up time is this:

[o.e.x.m.i.BasicDistributedJobsIT] [testFailOverBasics_withDataFeeder] Closing all datafeeds (using _all)
[o.e.x.m.d.DatafeedManager] [node_t2] [task has been removed, cancelling locally] attempt to stop datafeed [data_feed_id] [13]
[o.e.x.m.d.DatafeedManager] [node_t2] [task has been removed, cancelling locally] attempt to stop datafeed [data_feed_id] for job [fail-over-basics_with-data-feeder-job]
[o.e.x.m.d.DatafeedManager] [node_t2] [task has been removed, cancelling locally] try lock [0s] to stop datafeed [data_feed_id] for job [fail-over-basics_with-data-feeder-job]...
[o.e.x.m.d.DatafeedManager] [node_t2] [task has been removed, cancelling locally] stopping datafeed [data_feed_id] for job [fail-over-basics_with-data-feeder-job], acquired [true]...
[o.e.x.m.d.DatafeedManager] [node_t2] [task has been removed, cancelling locally] datafeed [data_feed_id] for job [fail-over-basics_with-data-feeder-job] has been stopped
[o.e.p.PersistentTasksClusterService] [node_t1] trying to update state on non-existing task datafeed-data_feed_id
[o.e.p.PersistentTasksClusterService] [node_t1] trying to update state on non-existing task datafeed-data_feed_id
[o.e.p.PersistentTasksNodeService] [node_t2] attempt to complete task [xpack/ml/datafeed[c]] with id [datafeed-data_feed_id] in the [COMPLETED] state

What is interesting is reason for stopping the datafeed task has been removed, cancelling locally this happens when the local executing node notices that a local task has been removed from the global cluster state and cancels the task accordingly. Possibly the task was removed by a call to PersistentTasksService::sendRemoveRequest but I'm not sure how.

The final line attempt to complete task [xpack/ml/datafeed[c]] with id [datafeed-data_feed_id] in the [COMPLETED] state indicates AllocatedPersistentTask::completeAndNotifyIfNeeded has been called more than once (either by AllocatedPersistentTask::markAsCompleted or AllocatedPersistentTask::markAsFailed)

I've attached the log file hopefully someone can make sense of this

testFailOverBasics_withDataFeeder.log

@ebadyano
Copy link
Contributor

@jakelandis
Copy link
Contributor

Another instance: https://gradle-enterprise.elastic.co/s/62ia4c3uyjpx6
17 times in past 30 days.
Can not find the full logs from GC upload (it is missing for some reason)

java.lang.RuntimeException: Had to resort to force-stopping datafeed, something went wrong?
	at __randomizedtesting.SeedInfo.seed([F5B875C8DA1CF6BC:1A0D29331D1BE831]:0)
	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:282)
	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.cleanupWorkaround(BaseMlIntegTestCase.java:195)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.util.concurrent.ExecutionException: RemoteTransportException[[node_t0][127.0.0.1:44939][cluster:admin/persistent/update_status]]; nested: ResourceNotFoundException[the task with id datafeed-data_feed_id and allocation id 12 doesn't exist];
	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
	at org.elasticsearch.xpack.ml.support.BaseMlIntegTestCase.deleteAllDatafeeds(BaseMlIntegTestCase.java:270)
	... 35 more
Caused by: RemoteTransportException[[node_t0][127.0.0.1:44939][cluster:admin/persistent/update_status]]; nested: ResourceNotFoundException[the task with id datafeed-data_feed_id and allocation id 12 doesn't exist];
Caused by: ResourceNotFoundException[the task with id datafeed-data_feed_id and allocation id 12 doesn't exist]
	at org.elasticsearch.persistent.PersistentTasksClusterService$4.execute(PersistentTasksClusterService.java:234)
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:697)
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:319)
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:214)
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(Thread.java:834)
```

@javanna
Copy link
Member

javanna commented Aug 28, 2019

@tvernum
Copy link
Contributor

tvernum commented Sep 4, 2019

Another one (6.8). The message is slightly different, but I'm assuming that it's the same cause.

1> [2019-09-04T05:38:07,687][WARN ][o.e.p.PersistentTasksNodeService] [node_t2] task job-fail-over-basics_with-data-feeder-job failed with an exception
--
1> org.elasticsearch.ResourceNotFoundException: No known job with id 'fail-over-basics_with-data-feeder-job'

jkakavas added a commit to jkakavas/elasticsearch that referenced this issue Sep 24, 2019
@jkakavas
Copy link
Member

This seems to keep failing once or twice over every few days and just failed master intake (https://gradle-enterprise.elastic.co/s/acmvfes3ous5q) so I muted it in #46997

@droberts195
Copy link
Contributor

ResourceNotFoundException[the task with id datafeed-data_feed_id and allocation id 13 doesn't exist]

This is the error I fixed in #49367. I will try unmuting the test that was muted as a result of this issue and see what happens.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 22, 2019
The problem reported in elastic#44566 should be fixed by the change
that was made in elastic#49367, so the muted test can be unmuted.

Closes elastic#44566
@droberts195 droberts195 self-assigned this Nov 22, 2019
droberts195 added a commit that referenced this issue Nov 22, 2019
The problem reported in #44566 should be fixed by the change
that was made in #49367, so the muted test can be unmuted.

Closes #44566
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants