-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] SpecificMasterNodesIT#testElectOnlyBetweenMasterNodes fails #38331
Comments
Pinging @elastic/es-distributed |
@ywelsch @DaveCTurner to mute or not to mute? 💀 |
mute :) |
Muted on master with 5ee7232. |
Today we throw a fatal `RuntimeException` if an exception occurs in `getMasterName()`, and this includes the case where there is currently no master. However, sometimes we call this method inside an `assertBusy()` in order to allow for a cluster that is in the process of stabilising and electing a master. The trouble is that `assertBusy()` only retries on an `AssertionError` and not on a general `RuntimeException`, so the lack of a master is immediately fatal. This commit fixes the issue by asserting there is a master, triggering a retry if there is not. Fixes elastic#38331
Today we throw a fatal `RuntimeException` if an exception occurs in `getMasterName()`, and this includes the case where there is currently no master. However, sometimes we call this method inside an `assertBusy()` in order to allow for a cluster that is in the process of stabilising and electing a master. The trouble is that `assertBusy()` only retries on an `AssertionError` and not on a general `RuntimeException`, so the lack of a master is immediately fatal. This commit fixes the issue by asserting there is a master, triggering a retry if there is not. Fixes #38331
Failed again today on 7.x: with:
And it does not reproduce with
I muted the test on master (f1d801c) and 7.x (983b5d1) since it failed on those two branches in the last 30 days. |
I've muted this as well in 7.0 branch |
I've reeenabled this test on master to get more recent failures |
Today in `SpecificMasterNodesIT` we assert the name of the master and throw a NPE if there is no master. This doesn't work within an `assertBusy()` because the NPE triggers an immediate failure rather than the desired retry. This commit addresses this by first asserting that the master is non-null. Fixes elastic#38331 Relates elastic#38432
Today the `TransportClusterStateAction` ignores the state passed by the `TransportMasterNodeAction` and obtains its state from the cluster applier. This might be inconsistent, showing a different node as the master or maybe even having no master. This change adjusts the action to use the passed-in state directly, and adds tests showing that the state returned is consistent with our expectations even if there is a concurrent master failover. Fixes elastic#38331 Relates elastic#38432
Today the `TransportClusterStateAction` ignores the state passed by the `TransportMasterNodeAction` and obtains its state from the cluster applier. This might be inconsistent, showing a different node as the master or maybe even having no master. This change adjusts the action to use the passed-in state directly, and adds tests showing that the state returned is consistent with our expectations even if there is a concurrent master failover. Fixes elastic#38331 Relates elastic#38432
Today the `TransportClusterStateAction` ignores the state passed by the `TransportMasterNodeAction` and obtains its state from the cluster applier. This might be inconsistent, showing a different node as the master or maybe even having no master. This change adjusts the action to use the passed-in state directly, and adds tests showing that the state returned is consistent with our expectations even if there is a concurrent master failover. Fixes #38331 Relates #38432
Today the `TransportClusterStateAction` ignores the state passed by the `TransportMasterNodeAction` and obtains its state from the cluster applier. This might be inconsistent, showing a different node as the master or maybe even having no master. This change adjusts the action to use the passed-in state directly, and adds tests showing that the state returned is consistent with our expectations even if there is a concurrent master failover. Fixes elastic#38331 Relates elastic#38432
Today the `TransportClusterStateAction` ignores the state passed by the `TransportMasterNodeAction` and obtains its state from the cluster applier. This might be inconsistent, showing a different node as the master or maybe even having no master. This change adjusts the action to use the passed-in state directly, and adds tests showing that the state returned is consistent with our expectations even if there is a concurrent master failover. Fixes #38331 Relates #38432
Seen at least three times in master in the last few days:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+internalClusterTest/497/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1777/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1710/console
Latest reproduce line:
Running repeatedly on Ubuntu didn't reproduce after 50 runs.
Logs contain NPEs:
But there are also several earlier connection Exceptions like:
The text was updated successfully, but these errors were encountered: