Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SpecificMasterNodesIT#testElectOnlyBetweenMasterNodes fails #38331

Closed
cbuescher opened this issue Feb 4, 2019 · 7 comments · Fixed by #38432 or #42454
Closed

[CI] SpecificMasterNodesIT#testElectOnlyBetweenMasterNodes fails #38331

cbuescher opened this issue Feb 4, 2019 · 7 comments · Fixed by #38432 or #42454
Assignees
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI v7.2.0

Comments

@cbuescher
Copy link
Member

cbuescher commented Feb 4, 2019

Seen at least three times in master in the last few days:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+internalClusterTest/497/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1777/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1710/console

Latest reproduce line:

./gradlew :server:integTest \
  -Dtests.seed=2CFBF83AF54213EC \
  -Dtests.class=org.elasticsearch.cluster.SpecificMasterNodesIT \
  -Dtests.method="testElectOnlyBetweenMasterNodes" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-DZ \
  -Dtests.timezone=Africa/Kinshasa \
  -Dcompiler.java=11 \
  -Druntime.java=8

Running repeatedly on Ubuntu didn't reproduce after 50 runs.

Logs contain NPEs:

14:52:15   1> java.lang.NullPointerException: null
14:52:15   1> 	at org.elasticsearch.test.InternalTestCluster.getMasterName(InternalTestCluster.java:1913) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
14:52:15   1> 	at org.elasticsearch.test.InternalTestCluster.getMasterName(InternalTestCluster.java:1902) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
14:52:15   1> 	at org.elasticsearch.test.InternalTestCluster.nonMasterClient(InternalTestCluster.java:776) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
14:52:15   1> 	at org.elasticsearch.cluster.SpecificMasterNodesIT.lambda$testElectOnlyBetweenMasterNodes$0(SpecificMasterNodesIT.java:122) ~[test/:?]
14:52:15   1> 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:846) [framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]

But there are also several earlier connection Exceptions like:

14:52:14   1> Caused by: java.net.ConnectException: Connection refused
@cbuescher cbuescher added >test-failure Triaged test failures from CI v7.0.0 :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@cbuescher
Copy link
Member Author

@ywelsch @DaveCTurner to mute or not to mute? 💀

@ywelsch
Copy link
Contributor

ywelsch commented Feb 4, 2019

mute :)

@cbuescher
Copy link
Member Author

Muted on master with 5ee7232.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Feb 5, 2019
Today we throw a fatal `RuntimeException` if an exception occurs in
`getMasterName()`, and this includes the case where there is currently no
master. However, sometimes we call this method inside an `assertBusy()` in
order to allow for a cluster that is in the process of stabilising and electing
a master. The trouble is that `assertBusy()` only retries on an
`AssertionError` and not on a general `RuntimeException`, so the lack of a
master is immediately fatal.

This commit fixes the issue by asserting there is a master, triggering a retry
if there is not.

Fixes elastic#38331
DaveCTurner added a commit that referenced this issue Feb 5, 2019
Today we throw a fatal `RuntimeException` if an exception occurs in
`getMasterName()`, and this includes the case where there is currently no
master. However, sometimes we call this method inside an `assertBusy()` in
order to allow for a cluster that is in the process of stabilising and electing
a master. The trouble is that `assertBusy()` only retries on an
`AssertionError` and not on a general `RuntimeException`, so the lack of a
master is immediately fatal.

This commit fixes the issue by asserting there is a master, triggering a retry
if there is not.

Fixes #38331
@tlrx
Copy link
Member

tlrx commented Feb 27, 2019

Failed again today on 7.x:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/337/consoleFull

with:

16:37:13 ERROR   3.47s J7 | SpecificMasterNodesIT.testElectOnlyBetweenMasterNodes <<< FAILURES!
16:37:13    > Throwable #1: java.lang.NullPointerException
16:37:13    > 	at __randomizedtesting.SeedInfo.seed([974B8C1608DDBAFC:DDD67F1D39FAAFDE]:0)
16:37:13    > 	at org.elasticsearch.cluster.SpecificMasterNodesIT.lambda$testElectOnlyBetweenMasterNodes$0(SpecificMasterNodesIT.java:123)
16:37:13    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:856)
16:37:13    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:842)
16:37:13    > 	at org.elasticsearch.cluster.SpecificMasterNodesIT.testElectOnlyBetweenMasterNodes(SpecificMasterNodesIT.java:121)
16:37:13    > 	at java.lang.Thread.run(Thread.java:748)

And it does not reproduce with

./gradlew :server:integTest -Dtests.seed=974B8C1608DDBAFC -Dtests.class=org.elasticsearch.cluster.SpecificMasterNodesIT -Dtests.method="testElectOnlyBetweenMasterNodes" -Dtests.security.manager=true -Dtests.locale=sk -Dtests.timezone=America/Belize -Dcompiler.java=11 -Druntime.java=8

I muted the test on master (f1d801c) and 7.x (983b5d1) since it failed on those two branches in the last 30 days.

@tlrx tlrx reopened this Feb 27, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Mar 28, 2019

I've muted this as well in 7.0 branch

@ywelsch
Copy link
Contributor

ywelsch commented May 22, 2019

I've reeenabled this test on master to get more recent failures

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 23, 2019
Today in `SpecificMasterNodesIT` we assert the name of the master and throw a
NPE if there is no master. This doesn't work within an `assertBusy()` because
the NPE triggers an immediate failure rather than the desired retry. This
commit addresses this by first asserting that the master is non-null.

Fixes elastic#38331
Relates elastic#38432
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 23, 2019
Today the `TransportClusterStateAction` ignores the state passed by the
`TransportMasterNodeAction` and obtains its state from the cluster applier.
This might be inconsistent, showing a different node as the master or maybe
even having no master.

This change adjusts the action to use the passed-in state directly, and adds
tests showing that the state returned is consistent with our expectations even
if there is a concurrent master failover.

Fixes elastic#38331
Relates elastic#38432
pull bot pushed a commit to fabriziofortino/elasticsearch that referenced this issue May 24, 2019
Today the `TransportClusterStateAction` ignores the state passed by the
`TransportMasterNodeAction` and obtains its state from the cluster applier.
This might be inconsistent, showing a different node as the master or maybe
even having no master.

This change adjusts the action to use the passed-in state directly, and adds
tests showing that the state returned is consistent with our expectations even
if there is a concurrent master failover.

Fixes elastic#38331
Relates elastic#38432
DaveCTurner added a commit that referenced this issue May 24, 2019
Today the `TransportClusterStateAction` ignores the state passed by the
`TransportMasterNodeAction` and obtains its state from the cluster applier.
This might be inconsistent, showing a different node as the master or maybe
even having no master.

This change adjusts the action to use the passed-in state directly, and adds
tests showing that the state returned is consistent with our expectations even
if there is a concurrent master failover.

Fixes #38331
Relates #38432
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
Today the `TransportClusterStateAction` ignores the state passed by the
`TransportMasterNodeAction` and obtains its state from the cluster applier.
This might be inconsistent, showing a different node as the master or maybe
even having no master.

This change adjusts the action to use the passed-in state directly, and adds
tests showing that the state returned is consistent with our expectations even
if there is a concurrent master failover.

Fixes elastic#38331
Relates elastic#38432
DaveCTurner added a commit that referenced this issue Jul 17, 2019
Today the `TransportClusterStateAction` ignores the state passed by the
`TransportMasterNodeAction` and obtains its state from the cluster applier.
This might be inconsistent, showing a different node as the master or maybe
even having no master.

This change adjusts the action to use the passed-in state directly, and adds
tests showing that the state returned is consistent with our expectations even
if there is a concurrent master failover.

Fixes #38331
Relates #38432
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI v7.2.0
Projects
None yet
7 participants