Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Timeouts in ml.integration.RegressionIT.testStopAndRestart #49680

Closed
przemekwitek opened this issue Nov 28, 2019 · 11 comments
Closed

[CI] Timeouts in ml.integration.RegressionIT.testStopAndRestart #49680

przemekwitek opened this issue Nov 28, 2019 · 11 comments
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@przemekwitek
Copy link
Contributor

testStopAndRestart test cases causes the suite to timeout:
https://gradle-enterprise.elastic.co/s/5i6pq2myin76e

@przemekwitek przemekwitek added the :ml Machine learning label Nov 28, 2019
@przemekwitek przemekwitek self-assigned this Nov 28, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195
Copy link
Contributor

There's more information about why in the jstack from the test suite timeout (see https://gradle-enterprise.elastic.co/s/5i6pq2myin76e/tests/zcquf3hc3eoda-uha67ucurjrey):

	- waiting on org.elasticsearch.common.util.concurrent.BaseFuture$Sync@2ed421c0
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
	at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
	at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:57)
	at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34)
	at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.stopAnalytics(MlNativeDataFrameAnalyticsIntegTestCase.java:118)
	at org.elasticsearch.xpack.ml.integration.RegressionIT.testStopAndRestart(RegressionIT.java:207)

So basically a request to stop the job never returned.

Logging from the client side of the test was simply:

[2019-11-28T09:40:46,337][INFO ][o.e.x.m.i.RegressionIT   ] [testStopAndRestart] before test
[2019-11-28T09:40:46,338][INFO ][o.e.x.m.i.RegressionIT   ] [testStopAndRestart] [RegressionIT#testStopAndRestart]: setting up test
[2019-11-28T09:40:46,340][INFO ][o.e.p.PluginsService     ] [testStopAndRestart] no modules loaded
[2019-11-28T09:40:46,340][INFO ][o.e.p.PluginsService     ] [testStopAndRestart] loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
[2019-11-28T09:40:46,340][INFO ][o.e.p.PluginsService     ] [testStopAndRestart] loaded plugin [org.elasticsearch.transport.Netty4Plugin]
[2019-11-28T09:40:46,340][INFO ][o.e.p.PluginsService     ] [testStopAndRestart] loaded plugin [org.elasticsearch.xpack.core.XPackClientPlugin]
[2019-11-28T09:40:46,587][INFO ][o.e.t.ExternalTestCluster] [testStopAndRestart] Setup ExternalTestCluster [integTest] made of [3] nodes
[2019-11-28T09:40:47,373][INFO ][o.e.x.m.i.RegressionIT   ] [testStopAndRestart] [RegressionIT#testStopAndRestart]: all set up test

@droberts195 droberts195 added the >test-failure Triaged test failures from CI label Nov 28, 2019
@przemekwitek
Copy link
Contributor Author

przemekwitek commented Nov 29, 2019

I have one suspicion which I still need to validate:
We do not seem to call finishHandler.accept if the process had a failure:

(another issue to investigate is why did the process have this failure)

@przemekwitek
Copy link
Contributor Author

The comment above may not be relevant, as:

  1. we call task.updateState(DataFrameAnalyticsState.FAILED, processContext.getFailureReason()) instead of finishHandler.accept which seems to be doing roughly the same thing.
  2. This is in the scope of "start" call, not "stop" (which is the one hanging on CI).

@przemekwitek
Copy link
Contributor Author

Unfortunately, the test failed again on 7.x (the build contained the fix).
Requires further investigation.

https://build-stats.elastic.co/app/kibana#/doc/b646ed00-7efc-11e8-bf69-63c8ef516157/build-*/t?id=20191203105801-5DD051AE&_g=()

@przemekwitek
Copy link
Contributor Author

On the failed CI test, when the C++ process was closing, it failed with:

 ERROR][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Marking task failed; [regression_stop_and_restart] Error closing data frame analyzer process [null]

It looks like it's seen fatal error, but there is no associated message.
@droberts195 has an idea that such an error without a message may be a manifestation of unhandled fatal signal like SIGSEGV, SIGILL or SIGBUS

I'll try to figure out if there is something common among those failed runs. So far, I discovered that this particular failure only happens on 7.x, never on master.

@przemekwitek
Copy link
Contributor Author

There are still related CI failures (approx. 1 per day).
This time, the stack traces are:

 9731 [2019-12-30T23:42:45,599][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-2] [regression_stop_and_restart] Closing process
 9732 [2019-12-30T23:42:45,601][ERROR][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-2] [regression_stop_and_restart] Error closing data frame analyzer process
 9733 java.nio.channels.ClosedChannelException: null
 9734     at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) ~[?:?]
 9735     at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:199) ~[?:?]
 9736     at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) ~[?:1.8.0_231]
 9737     at java.nio.channels.Channels.writeFully(Channels.java:101) ~[?:1.8.0_231]
 9738     at java.nio.channels.Channels.access$000(Channels.java:61) ~[?:1.8.0_231]
 9739     at java.nio.channels.Channels$1.write(Channels.java:174) ~[?:1.8.0_231]
 9740     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_231]
 9741     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_231]
 9742     at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_231]
 9743     at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.close(AbstractNativeProcess.java:166) ~[?:?]
 9744     at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.closeProcess(AnalyticsProcessManager.java:310) ~[?:?]
 9745     at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.processData(AnalyticsProcessManager.java:179) ~[?:?]
 9746     at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.lambda$runJob$1(AnalyticsProcessManager.java:120) ~[?:?]
 9747     at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
 9748     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_231]
 9749     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_231]
 9750     at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]
 97

on Dec 31

and:

11536 [2020-01-01T01:36:43,243][ERROR][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-1] [regression_stop_and_restart] Error closing data frame analyzer process^M
11537 java.io.IOException: Stream Closed^M
11538     at java.io.FileOutputStream.writeBytes(Native Method) ~[?:1.8.0_221]^M
11539     at java.io.FileOutputStream.write(FileOutputStream.java:326) ~[?:1.8.0_221]^M
11540     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_221]^M
11541     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_221]^M
11542     at java.io.FilterOutputStream.close(FilterOutputStream.java:158) ~[?:1.8.0_221]^M
11543     at org.elasticsearch.xpack.ml.process.AbstractNativeProcess.close(AbstractNativeProcess.java:166) ~[?:?]^M
11544     at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.closeProcess(AnalyticsProcessManager.java:310) ~[?:?]^M
11545     at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.processData(AnalyticsProcessManager.java:179) ~[?:?]^M
11546     at org.elasticsearch.xpack.ml.dataframe.process.AnalyticsProcessManager.lambda$runJob$1(AnalyticsProcessManager.java:120) ~[?:?]^M
11547     at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]^M
11548     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]^M
11549     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]^M
11550     at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]^M
11551 [2020-01-01T01:36:43,244][ERROR][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-1] [regression_stop_and_restart] Marking task failed; [regression_stop_and_restart] Error closing data frame analy      zer process [Stream Closed]^M

on Jan 1

It seems to me that there is a race between close() and kill() methods both trying to close the stream. The first one may succeed but the other one fails.

@przemekwitek
Copy link
Contributor Author

przemekwitek commented Jan 2, 2020

I've found this issue: https://bugs.openjdk.java.net/browse/JDK-8054565

It explains that in Java 8 FilterOutputStream::close method may not obey the contract of Closeable i.e. the second call to close() may throw the exception.
The issue is fixed in Java 9.
This issue describes the conditions that our tests suffers from perfectly (fails only in 7.x branch and only in 'compatibility' CI runs) but let me confirm that by looking at the FilterOutputStream implementation.
I'll post the investigation result here.
Another question is how does it work for anomaly detection and why it does not suffer from similar problem - to be investigated.

@przemekwitek
Copy link
Contributor Author

Looking at Java 8 source code it is clearly visible that the bug fix is not there (no additional closed and closeLock member variables in FilterOutputStream class):
https://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/io/FilterOutputStream.java

Additionally, the java home setting in the failing build is Java 8:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-windows-compatibility/os=windows-2012-r2/348/injectedEnvVars/

The combination of these two factors lets me conclude that the failed CI test run is indeed a result of the JDK issue mentioned in the previous post.

One way to fix it would be to add closeLock equivalent to AbstractNativeProcess class or some other means of synchronisation protecting against calling close() twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants