Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] NodeTests suite timeouts on new windows workers #44256

Closed
tvernum opened this issue Jul 12, 2019 · 4 comments · Fixed by #49202
Closed

[CI] NodeTests suite timeouts on new windows workers #44256

tvernum opened this issue Jul 12, 2019 · 4 comments · Fixed by #49202
Assignees
Labels
:Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI

Comments

@tvernum
Copy link
Contributor

tvernum commented Jul 12, 2019

On Windows 2016 and windows 2012 we getting suite timeouts in what seem to be the very first suite run in that build.

It appears to that the test takes a long time between finishing the test method, and being marked as "done" - possibly due to clean up tasks in @After?

Looking at the Windows-2016 failure:

  • We start :server:test at 4:43 (04:43:11 > Task :server:test)
  • We fail a little over 20 minutes (1,200,000 ms) later.
05:05:42 org.elasticsearch.node.NodeTests > testCloseOnLeakedStoreReference FAILED
05:05:42     java.lang.Exception: Test abandoned because suite timeout was reached
...
    java.lang.Exception: Suite timeout exceeded (>= 1200000 msec).
  • The node being tested was stopped early in the 20 mminutes (7:45 in the node's TZ)
1> [2019-07-12T07:45:46,034][INFO ][o.e.n.Node               ] [testCloseOnLeakedStoreReference] closed 
  • but the "after test" log doesn't come until we're killing off the suite (8:05)
1> [2019-07-12T08:05:47,854][INFO ][o.e.n.NodeTests          ] [testCloseOnLeakedStoreReference] after test

It seems that something on this new Ephemeral Windows CI worker is causing a problem moving this test method from "testing complete" to "test done"

@tvernum tvernum added :Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI labels Jul 12, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@alpar-t
Copy link
Contributor

alpar-t commented Jul 21, 2019

Saw this:

"TEST-NodeTests.testCloseOnOutstandingTask-seed#[5A77C82F64470C03]" ID=794 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6650632c
	at sun.misc.Unsafe.park(Native Method)
	- timed waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6650632c
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
	at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
	at org.elasticsearch.threadpool.ThreadPool.awaitTermination(ThreadPool.java:421)
	at org.elasticsearch.threadpool.ThreadPool.awaitTermination(ThreadPool.java:775)
	at org.elasticsearch.threadpool.ThreadPool.terminate(ThreadPool.java:759)
	at org.elasticsearch.node.Node.awaitClose(Node.java:901)
	- locked org.elasticsearch.node.MockNode@1c36784d
	at org.elasticsearch.node.NodeTests.testCloseOnOutstandingTask(NodeTests.java:164)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccess

here: https://scans.gradle.com/s/rhypi43cmywgq/tests/nfxodyych4uas-o4ihtq27xgiqs
The code reads :

assertTrue(node.awaitClose(1, TimeUnit.DAYS));

alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 21, 2019
alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 21, 2019
alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 21, 2019
alpar-t added a commit to alpar-t/elasticsearch that referenced this issue Jul 21, 2019
alpar-t added a commit that referenced this issue Jul 22, 2019
* Mute failing test

tracked in #44552

* mute EvilSecurityTests

tracking in #44558

* Fix line endings in ESJsonLayoutTests

* Mute failing ForecastIT  test on windows

Tracking in #44609

* mute AutoFollowIT.testConflictingPatterns

tracking in #44610

* mute BasicRenormalizationIT.testDefaultRenormalization

tracked in #44613

* Revert "mute AutoFollowIT.testConflictingPatterns"

This reverts commit 012de08.

* mute x-pack internal cluster test windows

tracking #44610

* Mute failure unconfigured node name

* fix mute testDefaultRenormalization

* Increase busyWait timeout windows is slow

* Mute JvmErgonomicsTests on windows

Tracking #44669

* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot

Tracking #44671

* Mute NodeTests on Windows

Tracking #44256
alpar-t added a commit that referenced this issue Jul 22, 2019
* Mute failing test

tracked in #44552

* mute EvilSecurityTests

tracking in #44558

* Fix line endings in ESJsonLayoutTests

* Mute failing ForecastIT  test on windows

Tracking in #44609

* mute BasicRenormalizationIT.testDefaultRenormalization

tracked in #44613

* fix mute testDefaultRenormalization

* Increase busyWait timeout windows is slow

* Mute failure unconfigured node name

* mute x-pack internal cluster test windows

tracking #44610

* Mute JvmErgonomicsTests on windows

Tracking #44669

* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot

Tracking #44671

* Mute NodeTests on Windows

Tracking #44256
alpar-t added a commit that referenced this issue Jul 22, 2019
* Mute failing test

tracked in #44552

* mute EvilSecurityTests

tracking in #44558

* Mute failing ForecastIT  test on windows

Tracking in #44609

* mute BasicRenormalizationIT.testDefaultRenormalization

tracked in #44613

* mute x-pack internal cluster test windows

tracking #44610

* Mute failure unconfigured node name

* fix mute testDefaultRenormalization

* Increase busyWait timeout windows is slow

* Mute JvmErgonomicsTests on windows

Tracking #44669

* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot

Tracking #44671

* Mute NodeTests on Windows

Tracking #44256
alpar-t added a commit that referenced this issue Jul 22, 2019
* Mute failing test

tracked in #44552

* mute EvilSecurityTests

tracking in #44558

* Mute failing ForecastIT  test on windows

Tracking in #44609

* mute BasicRenormalizationIT.testDefaultRenormalization

tracked in #44613

* Disable testing conventions on Windows (#43532) (#44506)

Tests are disabled on Windows. Conventions also need to be disabled.

* mute x-pack internal cluster test windows

tracking #44610

* Mute failure unconfigured node name

* fix mute testDefaultRenormalization

* Increase busyWait timeout windows is slow

* Disable task for mute

* Mute JvmErgonomicsTests on windows

Tracking #44669

* mute SharedClusterSnapshotRestoreIT testParallelRestoreOperationsFromSingleSnapshot

Tracking #44671

* Mute NodeTests on Windows

Tracking #44256
@jdconrad
Copy link
Contributor

@jaymode jaymode self-assigned this Nov 15, 2019
jaymode added a commit to jaymode/elasticsearch that referenced this issue Nov 15, 2019
The NodeTests class contains tests that check behavior when shutting
down a node. This involves starting a node, performing some operation,
stopping the node, and then awaiting the close of the node. Part of
closing a node is the termination of the node's ThreadPool. ThreadPool
termination semantics can be deceiving. The ThreadPool#terminate method
takes a timeout value and the first oddity is that the terminate method
can take two times the timeout value before returning. Internally this
method acts on the ExecutorService instances that are held by the
ThreadPool. First, an orderly shutdown is attempted and pending tasks
are allowed to execute while waiting for the timeout value. If any of
the ExecutorService instances have not terminated, a call is made to
attempt to stop all active tasks (usually using interrupts) and then
waits for up to the timeout value a second time for the termination of
the ExecutorService instances. This means that if use a large value
when waiting for a node to close, we may not attempt to interrupt any
threads that are in a blocking call before the test times out.

In order to avoid causing these tests to time out, this change reduces
the timeout passed to Node#awaitClose to 10 seconds from 1 day. This
will allow blocked threads to be interrupted before the test suite
fails due to the timeout.

Closes elastic#44256
Closes elastic#42350
jaymode added a commit that referenced this issue Nov 22, 2019
The NodeTests class contains tests that check behavior when shutting
down a node. This involves starting a node, performing some operation,
stopping the node, and then awaiting the close of the node. Part of
closing a node is the termination of the node's ThreadPool. ThreadPool
termination semantics can be deceiving. The ThreadPool#terminate method
takes a timeout value and the first oddity is that the terminate method
can take two times the timeout value before returning. Internally this
method acts on the ExecutorService instances that are held by the
ThreadPool. First, an orderly shutdown is attempted and pending tasks
are allowed to execute while waiting for the timeout value. If any of
the ExecutorService instances have not terminated, a call is made to
attempt to stop all active tasks (usually using interrupts) and then
waits for up to the timeout value a second time for the termination of
the ExecutorService instances. This means that if use a large value
when waiting for a node to close, we may not attempt to interrupt any
threads that are in a blocking call before the test times out.

In order to avoid causing these tests to time out, this change reduces
the timeout passed to Node#awaitClose to 10 seconds from 1 day. This
will allow blocked threads to be interrupted before the test suite
fails due to the timeout.

Closes #44256
Closes #42350
Closes #44435
jaymode added a commit that referenced this issue Nov 22, 2019
The NodeTests class contains tests that check behavior when shutting
down a node. This involves starting a node, performing some operation,
stopping the node, and then awaiting the close of the node. Part of
closing a node is the termination of the node's ThreadPool. ThreadPool
termination semantics can be deceiving. The ThreadPool#terminate method
takes a timeout value and the first oddity is that the terminate method
can take two times the timeout value before returning. Internally this
method acts on the ExecutorService instances that are held by the
ThreadPool. First, an orderly shutdown is attempted and pending tasks
are allowed to execute while waiting for the timeout value. If any of
the ExecutorService instances have not terminated, a call is made to
attempt to stop all active tasks (usually using interrupts) and then
waits for up to the timeout value a second time for the termination of
the ExecutorService instances. This means that if use a large value
when waiting for a node to close, we may not attempt to interrupt any
threads that are in a blocking call before the test times out.

In order to avoid causing these tests to time out, this change reduces
the timeout passed to Node#awaitClose to 10 seconds from 1 day. This
will allow blocked threads to be interrupted before the test suite
fails due to the timeout.

Closes #44256
Closes #42350
Closes #44435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants