Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] CI ForecastIT::testOverflowToDisk can fail due to index timing issues #31173

Closed
hendrikmuhs opened this issue Jun 7, 2018 · 5 comments · Fixed by #31187
Closed

[ML] CI ForecastIT::testOverflowToDisk can fail due to index timing issues #31173

hendrikmuhs opened this issue Jun 7, 2018 · 5 comments · Fixed by #31187
Assignees
Labels
discuss :ml Machine learning >test Issues or PRs that are addressing/adding tests v6.4.0 v7.0.0-beta1

Comments

@hendrikmuhs
Copy link

hendrikmuhs commented Jun 7, 2018

The failure has not been seen on the official CI yet, but on a private one:

 2> REPRODUCE WITH: ./gradlew :x-pack:qa:ml-native-tests:integTestRunner -Dtests.seed=E6CA9B79CBDD62D7 -Dtests.class=org.elasticsearch.xpack.ml.integration.ForecastIT -Dtests.method="testOverflowToDisk" -Dtests.security.manager=true -Dtests.locale=agq -Dtests.timezone=Europe/Malta
FAILURE 50.8s | ForecastIT.testOverflowToDisk <<< FAILURES!
Throwable #1: java.lang.AssertionError: 
Expected: <8000>
    but: was <7103>
	at __randomizedtesting.SeedInfo.seed([E6CA9B79CBDD62D7:A91A2523A2A36B67]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.elasticsearch.xpack.ml.integration.ForecastIT.testOverflowToDisk(ForecastIT.java:248)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at java.base/java.lang.Thread.run(Thread.java:844)
 1> [2018-06-06T14:20:28,079][INFO ][o.e.x.m.i.ForecastIT     ] [testNoData]: before test
 1> [2018-06-06T14:20:28,080][INFO ][o.e.x.m.i.ForecastIT     ] [ForecastIT#testNoData]: setting up test
 1> [2018-06-06T14:20:28,280][INFO ][o.e.x.m.i.ForecastIT     ] [ForecastIT#testNoData]: all set up test
 1> [2018-06-06T14:20:29,247][INFO ][o.e.x.m.i.ForecastIT     ] [ForecastIT#testNoData]: cleaning up after test
 1> [2018-06-06T14:20:29,557][INFO ][o.e.x.m.i.ForecastIT     ] [ForecastIT#testNoData]: cleaned up after test
 1> [2018-06-06T14:20:29,558][INFO ][o.e.x.m.i.ForecastIT     ] [testNoData]: after test

This is caused by a internal timing problem:

A document that marks the end of the forecast is indexed last but due to the near real time behavior, sharding and concurrency it can happen that not all forecast data points are written when the status document is written, meaning searchable.

The issue has been introduced in #30969. Before the test assertions the test closed the job which triggers an index refresh but after that change closing the job has been moved to the end. Therefore the test fix is as easy as closing the job again and opening it for the 2nd test.

Beside the test issue this is still a problem. The UI only closes the job if it was closed prior running the forecast. Open jobs are kept open and potentially hit the problem although this is rather unlikely as the UI side is slowed down by network communication round tripping. Furthermore aggregations are used for charting, missing a couple of datapoints likely does not produce a visual artifact.

@hendrikmuhs hendrikmuhs added >test Issues or PRs that are addressing/adding tests discuss v7.0.0 :ml Machine learning v6.4.0 labels Jun 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@hendrikmuhs
Copy link
Author

@hendrikmuhs hendrikmuhs self-assigned this Jun 7, 2018
@droberts195
Copy link
Contributor

Therefore the test fix is as easy as closing the job again and opening it for the 2nd test.

Another alternative would be to call the flush endpoint on the job. Then the same C++ process would keep running for the second forecast.

@droberts195
Copy link
Contributor

In fact, since forecasting requires admin rights, the UI could use the same solution - if the job is not going to be closed after a forecast completes because it was already opened then flush it instead.

@hendrikmuhs
Copy link
Author

Another alternative would be to call the flush endpoint on the job. Then the same C++ process would keep running for the second forecast.

👍 That's actually better as this is a regression test which should do 2 forecasts on the same running job/process, close fixes the issue but would break the purpose of the test

hendrikmuhs pushed a commit to hendrikmuhs/elasticsearch that referenced this issue Jun 7, 2018
hendrikmuhs pushed a commit that referenced this issue Jun 8, 2018
flush ml job to ensure all results have been written

fixes #31173
hendrikmuhs pushed a commit that referenced this issue Jun 8, 2018
flush ml job to ensure all results have been written

fixes #31173
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :ml Machine learning >test Issues or PRs that are addressing/adding tests v6.4.0 v7.0.0-beta1
Projects
None yet
4 participants