Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple CCR tests with uncaught exception on windows #44610

Closed
alpar-t opened this issue Jul 19, 2019 · 8 comments
Closed

Multiple CCR tests with uncaught exception on windows #44610

alpar-t opened this issue Jul 19, 2019 · 8 comments
Assignees
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI v7.2.2 v7.3.0 v7.4.0 v8.0.0-alpha1

Comments

@alpar-t
Copy link
Contributor

alpar-t commented Jul 19, 2019

https://scans.gradle.com/s/vl3yktrv4xey4/tests/htwk6wzdfugzg-ntvlkoitzt6ms?openStackTraces=WzIsMSwwXQ

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=243, name=elasticsearch[follower0][generic][T#4], state=RUNNABLE, group=TGRP-AutoFollowIT]Close stacktrace
at __randomizedtesting.SeedInfo.seed([DE1704EC3D6BE8E2:CD64108D18DC74F7]:0)
Caused by: java.lang.IllegalStateException: java.lang.InterruptedExceptionClose stacktrace
at __randomizedtesting.SeedInfo.seed([DE1704EC3D6BE8E2]:0)
at org.elasticsearch.transport.ConnectionManager.close(ConnectionManager.java:259)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.InterruptedException: (No message provided)Close stacktrace
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1040)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
at org.elasticsearch.transport.ConnectionManager.close(ConnectionManager.java:256)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)
@alpar-t alpar-t added >test-failure Triaged test failures from CI :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v8.0.0 labels Jul 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@alpar-t
Copy link
Contributor Author

alpar-t commented Jul 19, 2019

Fails for others as well ( in the same run as in the description )

:x-pack:plugin:ccr:internalClusterTestorg.elasticsearch.xpack.ccr.RestartIndexFollowingIT » classMethod (0.039s)
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=251, name=elasticsearch[followerm0][generic][T#4], state=RUNNABLE, group=TGRP-RestartIndexFollowingIT]Close stacktrace
Caused by: java.lang.IllegalStateException: java.lang.InterruptedExceptionClose stacktrace
at __randomizedtesting.SeedInfo.seed([DE1704EC3D6BE8E2]:0)
at org.elasticsearch.transport.ConnectionManager.close(ConnectionManager.java:259)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.InterruptedException: (No message provided)Close stacktrace
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1040)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
at org.elasticsearch.transport.ConnectionManager.close(ConnectionManager.java:256)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)

alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 19, 2019
alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 19, 2019
@alpar-t alpar-t changed the title AutoFollowIT » testConflictingPatterns fails with uncaught exception on windows Multiple CCR tests with uncaught exception on windows Jul 19, 2019
@alpar-t
Copy link
Contributor Author

alpar-t commented Jul 19, 2019

I have multiple runs on windows on my local CI and this failure only seems to happen for various CCR tests

@andrershov
Copy link
Contributor

@original-brownbear
Copy link
Member

The problem here is that a transport service is not getting fully started:


1> [2019-07-18T17:26:02,087][WARN ][o.e.t.n.MockNioTransport ] [followerm1] Potentially blocked execution on network thread [elasticsearch[followerm1][transport_worker][T#1]] [WAITING] [2803 milliseconds]:
--
1> java.base@11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
1> java.base@11.0.3/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
1> java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
1> java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
1> java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
1> java.base@11.0.3/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
1> app//org.elasticsearch.transport.TransportService.onRequestReceived(TransportService.java:916)
1> app//org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:161)
1> app//org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:121)
1> app//org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105)
1> app//org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:661)
1> app//org.elasticsearch.transport.TcpTransport.consumeNetworkReads(TcpTransport.java:685)
1> app//org.elasticsearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:269)
1> app//org.elasticsearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:199)
1> app//org.elasticsearch.nio.BytesChannelContext.read(BytesChannelContext.java:40)
1> app//org.elasticsearch.nio.EventHandler.handleRead(EventHandler.java:139)
1> app//org.elasticsearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:151)
1> app//org.elasticsearch.nio.NioSelector.handleRead(NioSelector.java:420)

=> we get stuck on the latch waiting for it to properly start up. Me and @cbuescher already investigated a similar issue in #41745 ... so @cbuescher it seems like this problem might have more to it since we're now seeing it in an IT as well.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 19, 2019
* We shouldn't just swallow the interrupt here quietly and keep going on the IO thread
   * Currently interrupt continues here just the same way an invocation of `acceptIncomingRequests` woudl have made things continue
* Relates elastic#44610
@original-brownbear
Copy link
Member

I opened #44622 which might surface some hidden issue here. It seems to me we're handling requests on integ-test nodes that never fully started and are swallowing interrupts which would explain the situation here (io loop never stops cleanly on an stopped (then interrupted) node).

original-brownbear added a commit that referenced this issue Jul 19, 2019
* We shouldn't just swallow the interrupt here quietly and keep going on the IO thread
   * Currently interrupt continues here just the same way an invocation of `acceptIncomingRequests` woudl have made things continue
* Relates #44610
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 19, 2019
…c#44622)

* We shouldn't just swallow the interrupt here quietly and keep going on the IO thread
   * Currently interrupt continues here just the same way an invocation of `acceptIncomingRequests` woudl have made things continue
* Relates elastic#44610
original-brownbear added a commit that referenced this issue Jul 19, 2019
#44627)

* We shouldn't just swallow the interrupt here quietly and keep going on the IO thread
   * Currently interrupt continues here just the same way an invocation of `acceptIncomingRequests` woudl have made things continue
* Relates #44610
alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 21, 2019
alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 21, 2019
alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 21, 2019
alpar-t added a commit that referenced this issue Jul 22, 2019
* Mute failing test

tracked in #44552

* mute EvilSecurityTests

tracking in #44558

* Fix line endings in ESJsonLayoutTests

* Mute failing ForecastIT  test on windows

Tracking in #44609

* mute AutoFollowIT.testConflictingPatterns

tracking in #44610

* mute BasicRenormalizationIT.testDefaultRenormalization

tracked in #44613

* Revert "mute AutoFollowIT.testConflictingPatterns"

This reverts commit 012de08.

* mute x-pack internal cluster test windows

tracking #44610

* Mute failure unconfigured node name

* fix mute testDefaultRenormalization

* Increase busyWait timeout windows is slow

* Mute JvmErgonomicsTests on windows

Tracking #44669

* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot

Tracking #44671

* Mute NodeTests on Windows

Tracking #44256
alpar-t added a commit that referenced this issue Jul 22, 2019
* Mute failing test

tracked in #44552

* mute EvilSecurityTests

tracking in #44558

* Fix line endings in ESJsonLayoutTests

* Mute failing ForecastIT  test on windows

Tracking in #44609

* mute BasicRenormalizationIT.testDefaultRenormalization

tracked in #44613

* fix mute testDefaultRenormalization

* Increase busyWait timeout windows is slow

* Mute failure unconfigured node name

* mute x-pack internal cluster test windows

tracking #44610

* Mute JvmErgonomicsTests on windows

Tracking #44669

* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot

Tracking #44671

* Mute NodeTests on Windows

Tracking #44256
alpar-t added a commit that referenced this issue Jul 22, 2019
* Mute failing test

tracked in #44552

* mute EvilSecurityTests

tracking in #44558

* Mute failing ForecastIT  test on windows

Tracking in #44609

* mute BasicRenormalizationIT.testDefaultRenormalization

tracked in #44613

* mute x-pack internal cluster test windows

tracking #44610

* Mute failure unconfigured node name

* fix mute testDefaultRenormalization

* Increase busyWait timeout windows is slow

* Mute JvmErgonomicsTests on windows

Tracking #44669

* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot

Tracking #44671

* Mute NodeTests on Windows

Tracking #44256
alpar-t added a commit that referenced this issue Jul 22, 2019
* Mute failing test

tracked in #44552

* mute EvilSecurityTests

tracking in #44558

* Mute failing ForecastIT  test on windows

Tracking in #44609

* mute BasicRenormalizationIT.testDefaultRenormalization

tracked in #44613

* Disable testing conventions on Windows (#43532) (#44506)

Tests are disabled on Windows. Conventions also need to be disabled.

* mute x-pack internal cluster test windows

tracking #44610

* Mute failure unconfigured node name

* fix mute testDefaultRenormalization

* Increase busyWait timeout windows is slow

* Disable task for mute

* Mute JvmErgonomicsTests on windows

Tracking #44669

* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot

Tracking #44671

* Mute NodeTests on Windows

Tracking #44256
@ywelsch
Copy link
Contributor

ywelsch commented Jul 24, 2019

This could be fixed by #44805 as well.

ywelsch added a commit that referenced this issue Jul 25, 2019
…#44805)

The problem is that RemoteClusterConnection closes the connection manager asynchronously, which races with the threadpool being shutdown at the end of the test.

Closes #44339
Closes #44610
@ywelsch ywelsch removed the v6.8.2 label Jul 25, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Jul 25, 2019

AFAICS this did not affect 6.8 and the tests have not been disabled there, which is why I've removed the label here. #44805 looks to fix the issue. It's only backported to 7.4, however, as it is based on other changes to 7.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI v7.2.2 v7.3.0 v7.4.0 v8.0.0-alpha1
Projects
None yet
Development

No branches or pull requests

7 participants