Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TooManyJobsIT.testCloseFailedJob fails on 7.x #54162

Closed
tlrx opened this issue Mar 25, 2020 · 7 comments
Closed

[CI] TooManyJobsIT.testCloseFailedJob fails on 7.x #54162

tlrx opened this issue Mar 25, 2020 · 7 comments
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@tlrx
Copy link
Member

tlrx commented Mar 25, 2020

The test TooManyJobsIT.testCloseFailedJob failed today on 7.x with the following error:

org.elasticsearch.xpack.ml.integration.TooManyJobsIT > testCloseFailedJob FAILED
    java.util.concurrent.ExecutionException: java.util.NoSuchElementException
        at __randomizedtesting.SeedInfo.seed([8417034E9E7F8197:BA3760508E9D6C8C]:0)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253)
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
        at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testCloseFailedJob(TooManyJobsIT.java:38)

Build scan: https://gradle-enterprise.elastic.co/s/zzidgoauwly4c/

It does not reproduce with:

./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testCloseFailedJob" -Dtests.seed=360EA21AB8CEB6E1 -Dtests.security.manager=true -Dtests.locale=es-BO -Dtests.timezone=America/Matamoros -Dcompiler.java=13 |  

Looking at the build stats, this test failed 3 times in the last 30 days and always on 7.x:

@tlrx tlrx added >test-failure Triaged test failures from CI :ml Machine learning labels Mar 25, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@tlrx tlrx changed the title [CI] TooManyJobsIT.testSingleNode fails on 7.x [CI] TooManyJobsIT.testCloseFailedJob fails on 7.x Mar 25, 2020
tlrx added a commit that referenced this issue Mar 25, 2020
@droberts195
Copy link
Contributor

  1. https://gradle-enterprise.elastic.co/s/zzidgoauwly4c is Timed out waiting for the ML templates to be installed
  2. https://gradle-enterprise.elastic.co/s/5grsxebyw367a is Error creating ML annotations index or aliases: org.elasticsearch.index.IndexNotFoundException: no such index [.ml-annotations-6]
  3. https://gradle-enterprise.elastic.co/s/y7jxn5r7z64gg/ is a java.util.NoSuchElementException from
    client().execute(PutJobAction.INSTANCE, putJobRequest).get();

@davidkyle
Copy link
Member

The telling stack trace is in the console log


2> java.util.concurrent.ExecutionException: java.util.NoSuchElementException |  
-- | --
  | at __randomizedtesting.SeedInfo.seed([8417034E9E7F8197:BA3760508E9D6C8C]:0) |  
  | at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:266) |  
  | at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:253) |  
  | at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87) |  
  | at org.elasticsearch.xpack.ml.integration.TooManyJobsIT.testCloseFailedJob(TooManyJobsIT.java:38) |  
  |   |  
  | Caused by: |  
  | java.util.NoSuchElementException |  
  | at com.carrotsearch.hppc.AbstractIterator.next(AbstractIterator.java:41) |  
  | at org.elasticsearch.xpack.ml.job.persistence.JobResultsProvider.lambda$getLatestIndexMappingsAndAddTerms$4(JobResultsProvider.java:337) |  
  | at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) |  
  | at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) |  
  | at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:70) |  
  | at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) |  
  | at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:89) |  
  | at org.elasticsearch.action.admin.indices.mapping.get.TransportGetMappingsAction.doMasterOperation(TransportGetMappingsAction.java:81) |  
  | at org.elasticsearch.action.admin.indices.mapping.get.TransportGetMappingsAction.doMasterOperation(TransportGetMappingsAction.java:42) |  
  | at org.elasticsearch.action.support.master.info.TransportClusterInfoAction.masterOperation(TransportClusterInfoAction.java:51) |  
  | at org.elasticsearch.action.support.master.info.TransportClusterInfoAction.masterOperation(TransportClusterInfoAction.java:33) |  
  | at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:99) |  
  | at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:170) |  
  | at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) |  
  | at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) |  
  | at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) |  
  | at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:170) |  
  | at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:133) |  
  | at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:110) |  
  | at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:59) |  
  | at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:153) |  
  | at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:129) |  
  | at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:64) |  
  | at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) |  
  | at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) |  
  | at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:396) |  
  | at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1231) |  
  | at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.getMappings(AbstractClient.java:1437) |  
  | at org.elasticsearch.xpack.core.ClientHelper.executeAsyncWithOrigin(ClientHelper.java:76) |  
  | at org.elasticsearch.xpack.ml.job.persistence.JobResultsProvider.getLatestIndexMappingsAndAddTerms(JobResultsProvider.java:344) |  
  | at org.elasticsearch.xpack.ml.job.persistence.JobResultsProvider.lambda$createJobResultIndex$2(JobResultsProvider.java:311) |  
  | at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) |  
  | at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) |  
  | at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:70) |  
  | at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) |  
  | at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:89) |  
  | at org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:163) |  
  | at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService.lambda$createIndex$2(MetaDataCreateIndexService.java:275) |  
  | at org.elasticsearch.action.support.ActiveShardsObserver$1.onNewClusterState(ActiveShardsObserver.java:84) |  
  | at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:311) |  
  | at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:196) |  
  | at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateListeners$6(ClusterApplierService.java:527) |  
  | at java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3527) |  
  | at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:743) |  
  | at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) |  
  | at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:523) |  
  | at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:498) |  
  | at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) |  
  | at org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73) |  
  | at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) |  
  | at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:693) |  
  | at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) |  
  | at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) |  
  | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) |  
  | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) |  
  | at java.lang.Thread.run(Thread.java:748)

The exception is thrown here because GetMappings returned an empty index mapping.

MappingMetaData typeMappings = indexMappings.iterator().next().value;

The function is called by an action listener in response to the .ml-anomalies-X index being created, this code adds extra mappings to the index after it was created from a template.

What is nice is the single node in the test is a master node and TransportGetMappingsAction executes in the calling thread so we seen the full stack trace from MetaDataCreateIndexService responding to creating the index to the call to TransportGetMappingsAction via the NodeClient to the empty index mapping being iterated.

Could it be the index was created without any mappings?

From the log file
1> [2020-03-20T11:42:23,587][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_t1] [.ml-anomalies-shared] creating index, cause [api], templates [], shards [1]/[1], mappings []

Note the empty mappings, compare with

1> [2020-03-20T11:42:23,702][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_t1] [.ml-annotations-6] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]

which has a mapping for _doc

@droberts195
Copy link
Contributor

So all three failures are in some way related to templates even though the final error that causes failure is different in each.

Given that this problem is happening on 7.x but not 7.6, it is probably related to the changes of #51765. If so it will be a problem in 7.7 as well. Before 7.7 GA we need to assess whether it's purely a test setup problem or whether it's a problem that could affect users too.

@davidkyle
Copy link
Member

The problem is in the test, it may have been exposed by #51765 or another change.

The test waits for the templates to be installed in a @ Before method but then the cluster is resized in startMlCluster(int, int)

In some cases the master node changes and the new master is a brand new node that does not yet have the templates installed. We don't wait for templates on the new master.

I don't know why this wasn't an issue before, everything fails in CI eventually but I haven't seen this before.

@davidkyle
Copy link
Member

I pushed a fix to wait on the ml templates in the newly formed cluster in #54162 and the test is unmuted in the backport #54801

Leaving this issue open for now as I am not certain the fix addresses the root cause

@droberts195
Copy link
Contributor

Closing this since a fix was made and it hasn't been updated since then with more failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants