[CI] Timeout failure in class CloseWhileRelocatingShardsIT, test testCloseWhileRelocatingShards #53544

williamrandolph · 2020-03-13T13:32:16Z

I see three CI test failures on master that seem to share a pattern. I can't reproduce the failure locally.

There is a lot of failure output, including three distinct exceptions. The first is an AssertionError about a timeout:

org.elasticsearch.indices.state.CloseWhileRelocatingShardsIT > testCloseWhileRelocatingShards FAILED
  java.lang.AssertionError: timed out waiting for green state
  at org.junit.Assert.fail(Assert.java:88)
  at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:887) 
  at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:840) 
  at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:829) 
  at org.elasticsearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards(CloseWhileRelocatingShardsIT.java:120)

The second is an assertion error about pending shard operations:

java.lang.AssertionError: shard [index-3][0] on node [node_sd1] has pending operations:
  --> BulkShardRequest [[index-3][0]] containing [5] requests
  at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:233)
  at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2689)
  at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:835)
  at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:292)
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:255)
  at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
  at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:696)
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:747)
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:834)
  at org.elasticsearch.test.InternalTestCluster.lambda$assertNoPendingIndexOperations$11(InternalTestCluster.java:1157)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:887)
  at org.elasticsearch.test.InternalTestCluster.assertNoPendingIndexOperations(InternalTestCluster.java:1148)
  at org.elasticsearch.test.InternalTestCluster.beforeIndexDeletion(InternalTestCluster.java:1142) 
  at org.elasticsearch.test.ESIntegTestCase.beforeIndexDeletion(ESIntegTestCase.java:575) 
  at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:545)
  at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:1936)
  […]

Finally, there's a thread leak error:

org.elasticsearch.indices.state.CloseWhileRelocatingShardsIT > classMethod FAILED
    com.carrotsearch.randomizedtesting.ThreadLeakError: 3 threads leaked from SUITE scope at org.elasticsearch.indices.state.CloseWhileRelocatingShardsIT: 
       1) Thread[id=1114, name=Thread-6, state=RUNNABLE, group=TGRP-CloseWhileRelocatingShardsIT]
            at java.base@11.0.2/java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958)
            at java.base@11.0.2/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
            at app//org.elasticsearch.test.BackgroundIndexer$1.run(BackgroundIndexer.java:160)
       2) Thread[id=1109, name=Thread-5, state=WAITING, group=TGRP-CloseWhileRelocatingShardsIT]
            at java.base@11.0.2/jdk.internal.misc.Unsafe.park(Native Method)
            at java.base@11.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
            at app//org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
            at app//org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
            at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
            at app//org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:37)
            at app//org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:52)
            at app//org.elasticsearch.test.BackgroundIndexer$1.run(BackgroundIndexer.java:159)
       3) Thread[id=1090, name=Thread-4, state=WAITING, group=TGRP-CloseWhileRelocatingShardsIT]
            at java.base@11.0.2/jdk.internal.misc.Unsafe.park(Native Method)
            at java.base@11.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
            at app//org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
            at app//org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
            at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
            at app//org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:37)
            at app//org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:52)
            at app//org.elasticsearch.test.BackgroundIndexer$1.run(BackgroundIndexer.java:159)
        at __randomizedtesting.SeedInfo.seed([6350B3BE3F809D73]:0)

There is a lot of other log output in the build logs.

Links to build scans:

March 13: https://gradle-enterprise.elastic.co/s/tbjuwp756ed54
March 2: https://gradle-enterprise.elastic.co/s/5dfhpzpdukz32
February 25: https://gradle-enterprise.elastic.co/s/3ks6xx6vpt4ls

It looks like the last time we addressed a test failure here was #49403.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-13T13:32:18Z

Pinging @elastic/es-distributed (:Distributed/Recovery)

Closes elastic#53544

The test in CloseWhileRelocatingShardsIT failed recently multiple times (3) when waiting for initial indices to be become green. Looking at the execution logs from #53544 it appears at the very beginning of the test and when the WindowsFS file system is picked up (which is known to slow down tests). This commit simply increases the timeout for the first ensureGreen() to 60 seconds. If the test continues to fail, we might want to test a larger timeout or disable WindowsFS for this test. Closes #53544

williamrandolph added >test-failure Triaged test failures from CI :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 labels Mar 13, 2020

tlrx self-assigned this Mar 18, 2020

tlrx added a commit to tlrx/elasticsearch that referenced this issue Mar 23, 2020

Increase ensureGreen() timeout in CloseWhileRelocatingShardsIT

1ad5fc0

Closes elastic#53544

tlrx mentioned this issue Mar 23, 2020

Increase ensureGreen() timeout in CloseWhileRelocatingShardsIT #53981

Merged

tlrx closed this as completed in #53981 Mar 23, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Timeout failure in class CloseWhileRelocatingShardsIT, test testCloseWhileRelocatingShards #53544

[CI] Timeout failure in class CloseWhileRelocatingShardsIT, test testCloseWhileRelocatingShards #53544

williamrandolph commented Mar 13, 2020

elasticmachine commented Mar 13, 2020

[CI] Timeout failure in class CloseWhileRelocatingShardsIT, test testCloseWhileRelocatingShards #53544

[CI] Timeout failure in class CloseWhileRelocatingShardsIT, test testCloseWhileRelocatingShards #53544

Comments

williamrandolph commented Mar 13, 2020

elasticmachine commented Mar 13, 2020