Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Timeout failure in class CloseWhileRelocatingShardsIT, test testCloseWhileRelocatingShards #53544

Closed
williamrandolph opened this issue Mar 13, 2020 · 1 comment · Fixed by #53981
Assignees
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@williamrandolph
Copy link
Contributor

I see three CI test failures on master that seem to share a pattern. I can't reproduce the failure locally.

There is a lot of failure output, including three distinct exceptions. The first is an AssertionError about a timeout:

org.elasticsearch.indices.state.CloseWhileRelocatingShardsIT > testCloseWhileRelocatingShards FAILED
  java.lang.AssertionError: timed out waiting for green state
  at org.junit.Assert.fail(Assert.java:88)
  at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:887) 
  at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:840) 
  at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:829) 
  at org.elasticsearch.indices.state.CloseWhileRelocatingShardsIT.testCloseWhileRelocatingShards(CloseWhileRelocatingShardsIT.java:120)

The second is an assertion error about pending shard operations:

java.lang.AssertionError: shard [index-3][0] on node [node_sd1] has pending operations:
  --> BulkShardRequest [[index-3][0]] containing [5] requests
  at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:233)
  at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2689)
  at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:835)
  at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:292)
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:255)
  at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
  at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:696)
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:747)
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.base/java.lang.Thread.run(Thread.java:834)
  at org.elasticsearch.test.InternalTestCluster.lambda$assertNoPendingIndexOperations$11(InternalTestCluster.java:1157)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:887)
  at org.elasticsearch.test.InternalTestCluster.assertNoPendingIndexOperations(InternalTestCluster.java:1148)
  at org.elasticsearch.test.InternalTestCluster.beforeIndexDeletion(InternalTestCluster.java:1142) 
  at org.elasticsearch.test.ESIntegTestCase.beforeIndexDeletion(ESIntegTestCase.java:575) 
  at org.elasticsearch.test.ESIntegTestCase.afterInternal(ESIntegTestCase.java:545)
  at org.elasticsearch.test.ESIntegTestCase.cleanUpCluster(ESIntegTestCase.java:1936)
  […]

Finally, there's a thread leak error:

org.elasticsearch.indices.state.CloseWhileRelocatingShardsIT > classMethod FAILED
    com.carrotsearch.randomizedtesting.ThreadLeakError: 3 threads leaked from SUITE scope at org.elasticsearch.indices.state.CloseWhileRelocatingShardsIT: 
       1) Thread[id=1114, name=Thread-6, state=RUNNABLE, group=TGRP-CloseWhileRelocatingShardsIT]
            at java.base@11.0.2/java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958)
            at java.base@11.0.2/java.util.Spliterators$1Adapter.hasNext(Spliterators.java:681)
            at app//org.elasticsearch.test.BackgroundIndexer$1.run(BackgroundIndexer.java:160)
       2) Thread[id=1109, name=Thread-5, state=WAITING, group=TGRP-CloseWhileRelocatingShardsIT]
            at java.base@11.0.2/jdk.internal.misc.Unsafe.park(Native Method)
            at java.base@11.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
            at app//org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
            at app//org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
            at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
            at app//org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:37)
            at app//org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:52)
            at app//org.elasticsearch.test.BackgroundIndexer$1.run(BackgroundIndexer.java:159)
       3) Thread[id=1090, name=Thread-4, state=WAITING, group=TGRP-CloseWhileRelocatingShardsIT]
            at java.base@11.0.2/jdk.internal.misc.Unsafe.park(Native Method)
            at java.base@11.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
            at java.base@11.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
            at app//org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
            at app//org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
            at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
            at app//org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:37)
            at app//org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:52)
            at app//org.elasticsearch.test.BackgroundIndexer$1.run(BackgroundIndexer.java:159)
        at __randomizedtesting.SeedInfo.seed([6350B3BE3F809D73]:0)

There is a lot of other log output in the build logs.

Links to build scans:

It looks like the last time we addressed a test failure here was #49403.

@williamrandolph williamrandolph added >test-failure Triaged test failures from CI :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 labels Mar 13, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Recovery)

@tlrx tlrx self-assigned this Mar 18, 2020
tlrx added a commit to tlrx/elasticsearch that referenced this issue Mar 23, 2020
tlrx added a commit that referenced this issue Mar 23, 2020
The test in CloseWhileRelocatingShardsIT failed recently 
multiple times (3) when waiting for initial indices to be 
become green. Looking at the execution logs from  #53544
 it appears at the very beginning of the test and when 
the WindowsFS file system is picked up (which is known 
to slow down tests).

This commit simply increases the timeout for the first 
ensureGreen() to 60 seconds. If the test continues to fail, 
we might want to test a larger timeout or disable 
WindowsFS for this test.

Closes #53544
tlrx added a commit that referenced this issue Mar 23, 2020
The test in CloseWhileRelocatingShardsIT failed recently 
multiple times (3) when waiting for initial indices to be 
become green. Looking at the execution logs from  #53544
 it appears at the very beginning of the test and when 
the WindowsFS file system is picked up (which is known 
to slow down tests).

This commit simply increases the timeout for the first 
ensureGreen() to 60 seconds. If the test continues to fail, 
we might want to test a larger timeout or disable 
WindowsFS for this test.

Closes #53544
tlrx added a commit that referenced this issue Mar 23, 2020
The test in CloseWhileRelocatingShardsIT failed recently 
multiple times (3) when waiting for initial indices to be 
become green. Looking at the execution logs from  #53544
 it appears at the very beginning of the test and when 
the WindowsFS file system is picked up (which is known 
to slow down tests).

This commit simply increases the timeout for the first 
ensureGreen() to 60 seconds. If the test continues to fail, 
we might want to test a larger timeout or disable 
WindowsFS for this test.

Closes #53544
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants