Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] NetworkDisruptionIT.testJobRelocation failure in 6.7 #39858

Closed
hub-cap opened this issue Mar 8, 2019 · 4 comments · Fixed by #43441
Closed

[CI] NetworkDisruptionIT.testJobRelocation failure in 6.7 #39858

hub-cap opened this issue Mar 8, 2019 · 4 comments · Fixed by #43441
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@hub-cap
Copy link
Contributor

hub-cap commented Mar 8, 2019

A different test entirely failed with the same error:

java.lang.IllegalStateException: cluster failed to form with expected nodes [{node_t3}{oBjDlzCCRWyVfZ-P_eRUHQ}{i0jt9hlDTNOsBLqxguBkmA}{127.0.0.1}{127.0.0.1:42657}{ml.machine_memory=63315337216, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, {node_t0}{DlUmj6TbRSiaq7G2tRgZVw}{dXStLUMXTvS1aYnecwto7g}{127.0.0.1}{127.0.0.1:41161}{ml.machine_memory=63315337216, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, {node_t4}{zgB7AcZ_RfG2_dlOhmpaSg}{oZTkIXK7RYKQLPwu49eAkQ}{127.0.0.1}{127.0.0.1:43415}{ml.machine_memory=63315337216, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, {node_t1}{lM32iLA9R2e3DNRYJyUPvQ}{SeYr2NeATHyeZY4ywLyW_w}{127.0.0.1}{127.0.0.1:40785}{ml.machine_memory=63315337216, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, {node_t2}{6bePpYnETHSKhA4qYCennQ}{w14d_pENQO6Vj9ZSWGsD9Q}{127.0.0.1}{127.0.0.1:46133}{ml.machine_memory=63315337216, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}] and actual nodes nodes: 
   {node_t3}{oBjDlzCCRWyVfZ-P_eRUHQ}{i0jt9hlDTNOsBLqxguBkmA}{127.0.0.1}{127.0.0.1:42657}{ml.machine_memory=63315337216, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
   {node_t1}{lM32iLA9R2e3DNRYJyUPvQ}{SeYr2NeATHyeZY4ywLyW_w}{127.0.0.1}{127.0.0.1:40785}{ml.machine_memory=63315337216, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
   {node_t2}{6bePpYnETHSKhA4qYCennQ}{w14d_pENQO6Vj9ZSWGsD9Q}{127.0.0.1}{127.0.0.1:46133}{ml.machine_memory=63315337216, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local, master

Log here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+internalClusterTest/1123/console

Reproduction line is:

REPRODUCE WITH: ./gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=AA25CE803F029E81 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.NetworkDisruptionIT \
  -Dtests.method="testJobRelocation" \
  -Dtests.security.manager=true \
  -Dtests.locale=und \
  -Dtests.timezone=Asia/Phnom_Penh \
  -Dcompiler.java=11 \
  -Druntime.java=8

The failure is at the start of the test, in internalCluster().ensureAtLeastNumDataNodes(5);, so I don't think it's specific to the test itself

Taken from

See #37462 for context on the original issue and the split up issue.

@hub-cap hub-cap added >test-failure Triaged test failures from CI :ml Machine learning labels Mar 8, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@davidkyle
Copy link
Member

Another instance https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2535/console

Does not reproduce

./gradlew :x-pack:plugin:ml:internalClusterTest \
  -Dtests.seed=3AC410EEC55EE830 \
  -Dtests.class=org.elasticsearch.xpack.ml.integration.NetworkDisruptionIT \
  -Dtests.method="testJobRelocation" \
  -Dtests.security.manager=true \
  -Dtests.locale=be \
  -Dtests.timezone=America/Thunder_Bay \
  -Dcompiler.java=11 \
  -Druntime.java=8

test.log

There have been a number of instances of this test failing, muted on master 27346a0, 7.x 78a9754 and 7.0 3499f3e

@davidkyle
Copy link
Member

davidkyle commented Jun 20, 2019

I unmuted this in #43268 expecting recent changes to have fixed the issue but the test failed in CI fairly quickly although with a different error.

https://scans.gradle.com/s/drxp7uehj4yhk/tests/lf2lfu4ufazso-4wtidedbn6cyi?openStackTraces=WzFd

java.lang.RuntimeException: Can't get master node null
Open stacktrace
Caused by: org.elasticsearch.discovery.MasterNotDiscoveredException: (No message provided)Close stacktrace
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.onTimeout(TransportMasterNodeAction.java:251)
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325)
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252)
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:566)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:687)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)
./gradlew :x-pack:plugin:ml:internalClusterTest --tests "org.elasticsearch.xpack.ml.integration.NetworkDisruptionIT.testJobRelocation" -Dtests.seed=2E6AFD7100DF42E9 -Dtests.security.manager=true -Dtests.locale=de-LI -Dtests.timezone=Brazil/DeNoronha -Dcompiler.java=12 -Druntime.java=11

Log file attached:
network-disruption.log

Muted the test (again) in 21feeb0

@davidkyle
Copy link
Member

The test creates a 5 node cluster then partitions one of those nodes. The 4 node side forms a new cluster and the single node is left trying and failing to form a cluster by itself. Later a call is made to internalCluster().getMasterName() which randomly select any of the original 5 nodes. Sometimes that node is the partitioned node and the test fails. Pretty simple really.

I've no idea why the test started failing around March 19 I'll look for a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants