-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decider: endtoend test infrastructure + tests #6770
Conversation
Signed-off-by: deepthi <deepthi@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work! Please see inline comment for TestDownMaster()
, which is a general observation on the possible approaches to run the tests, and why one may be more advantageous than the other.
for _, tablet := range shard0.Vttablets { | ||
// we know we have only two tablets, so the "other" one must be master | ||
if tablet.Alias != curMaster.Alias { | ||
checkMasterTablet(t, tablet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, so this is a busy loop that waits for the tablet to get into some status? I think I understand now. I'd rather:
- avoid busy loop, insert some reasonable
time.Sleep(time.Second)
in the loop, or else these tests will be difficult to run on one's computer and in CI - set some timeout for failure?
- or, really change the approach. orchestrator's system tests work the other way around: we setup the topology, we wait 15 seconds (by way of example) and then we expect action to have taken place.
This approach makes sense for production, because in production, you not only expect orchestrator to run a failover, you also expect it to accomplish it within a reasonable timeframe.
Also, in this test, you wait until some state is known, and then immediately test that state. But there might be intermediate steps to the recovery that may promote one server, then another. So testing "as soon as there's some state" may land you in an intermediate situation.
IMHO waiting X seconds (as opposed to looping for some state) is a more correct approach, I don't feel strongly; if you add Sleep+timeout instead, that also can makes sense in some situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a busy loop. In fact, it implements something very similar to what you are suggesting - sleep 1 second between checks and timeout after 60 seconds.
I had initially tried a 15 second sleep right after starting orchestrator and checking for master tablet once. According to my testing, it takes ~21 seconds on my local machine to get a new master elected when I have "RecoveryPeriodBlockSeconds": 1,
and ~41 seconds when I set that to 5.
Waiting for a fixed amount of time tends to be unreliable on CI so it is actually better to poll for a desired condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you make a good point here:
But there might be intermediate steps to the recovery that may promote one server, then another. So testing "as soon as there's some state" may land you in an intermediate situation.
Let us see if we do end up in that state in any of the scenarios we want to test and if we do we can improve the checks to be aware of the possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
41 seconds are far too long. I suspect MySQL configuration. Are replicas configured with:
set global slave_net_timeout = 4
, see documentation. This sets a short (2sec) heartbeat interval between a replica and its master, and will make the replica recognize failure quickly. Without this setting, some scenarios may take up to a minute to detect.CHANGE MASTER TO MASTER_CONNECT_RETRY=1, MASTER_RETRY_COUNT=86400
. In the event of replication failure, make the replica attempt reconnection every 1sec (default is 60sec). With brief network issues this setting attempts a quick replication recovery and, if successful, will avoid a general failure/recovery operation by orchestrator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the config file used by tests (default-fast.cnf
) to set the short heartbeat interval.
I see that MASTER_RETRY_COUNT
defaults to 86400
anyway. Regarding MASTER_CONNECT_RETRY
, we are bringing up clusters with no master (and no replication setup) and allowing orc to elect the master. I think this means that we should make a change here to add MASTER_CONNECT_RETRY=1
https://github.com/vitessio/vitess/blob/master/go/vt/orchestrator/inst/instance_topology_dao.go#L677
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thoughts, the changes to MASTER_RETRY_COUNT
can be in a separate PR (since we will have more PRs with tests in them). I have added a task to the issue to keep track.
Signed-off-by: deepthi <deepthi@planetscale.com>
…in DownMaster case Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
0da328f
to
2bf4532
Compare
Signed-off-by: deepthi <deepthi@planetscale.com>
Partial fix for #6769
Signed-off-by: deepthi deepthi@planetscale.com