decider: endtoend test infrastructure + tests #6770

deepthi · 2020-09-22T19:07:47Z

Partial fix for #6769

Signed-off-by: deepthi deepthi@planetscale.com

Signed-off-by: deepthi <deepthi@planetscale.com>

go/test/endtoend/orchestrator/orc_test.go

shlomi-noach

Good work! Please see inline comment for TestDownMaster(), which is a general observation on the possible approaches to run the tests, and why one may be more advantageous than the other.

go/test/endtoend/cluster/cluster_process.go

go/test/endtoend/cluster/orchestrator_process.go

go/test/endtoend/orchestrator/orc_test.go

shlomi-noach · 2020-09-23T06:13:07Z

go/test/endtoend/orchestrator/orc_test.go

+ for _, tablet := range shard0.Vttablets {
+ // we know we have only two tablets, so the "other" one must be master
+ if tablet.Alias != curMaster.Alias {
+ checkMasterTablet(t, tablet)


ah, so this is a busy loop that waits for the tablet to get into some status? I think I understand now. I'd rather:

avoid busy loop, insert some reasonable time.Sleep(time.Second) in the loop, or else these tests will be difficult to run on one's computer and in CI

set some timeout for failure?

or, really change the approach. orchestrator's system tests work the other way around: we setup the topology, we wait 15 seconds (by way of example) and then we expect action to have taken place.
This approach makes sense for production, because in production, you not only expect orchestrator to run a failover, you also expect it to accomplish it within a reasonable timeframe.
Also, in this test, you wait until some state is known, and then immediately test that state. But there might be intermediate steps to the recovery that may promote one server, then another. So testing "as soon as there's some state" may land you in an intermediate situation.
IMHO waiting X seconds (as opposed to looping for some state) is a more correct approach, I don't feel strongly; if you add Sleep+timeout instead, that also can makes sense in some situations.

It's not a busy loop. In fact, it implements something very similar to what you are suggesting - sleep 1 second between checks and timeout after 60 seconds.
I had initially tried a 15 second sleep right after starting orchestrator and checking for master tablet once. According to my testing, it takes ~21 seconds on my local machine to get a new master elected when I have "RecoveryPeriodBlockSeconds": 1, and ~41 seconds when I set that to 5.
Waiting for a fixed amount of time tends to be unreliable on CI so it is actually better to poll for a desired condition.

you make a good point here:

But there might be intermediate steps to the recovery that may promote one server, then another. So testing "as soon as there's some state" may land you in an intermediate situation.

Let us see if we do end up in that state in any of the scenarios we want to test and if we do we can improve the checks to be aware of the possibility.

41 seconds are far too long. I suspect MySQL configuration. Are replicas configured with:

set global slave_net_timeout = 4, see documentation. This sets a short (2sec) heartbeat interval between a replica and its master, and will make the replica recognize failure quickly. Without this setting, some scenarios may take up to a minute to detect.

CHANGE MASTER TO MASTER_CONNECT_RETRY=1, MASTER_RETRY_COUNT=86400. In the event of replication failure, make the replica attempt reconnection every 1sec (default is 60sec). With brief network issues this setting attempts a quick replication recovery and, if successful, will avoid a general failure/recovery operation by orchestrator.

See https://github.com/openark/orchestrator/blob/master/docs/configuration-failure-detection.md#mysql-configuration

I have changed the config file used by tests (default-fast.cnf) to set the short heartbeat interval.
I see that MASTER_RETRY_COUNT defaults to 86400 anyway. Regarding MASTER_CONNECT_RETRY, we are bringing up clusters with no master (and no replication setup) and allowing orc to elect the master. I think this means that we should make a change here to add MASTER_CONNECT_RETRY=1
https://github.com/vitessio/vitess/blob/master/go/vt/orchestrator/inst/instance_topology_dao.go#L677

On second thoughts, the changes to MASTER_RETRY_COUNT can be in a separate PR (since we will have more PRs with tests in them). I have added a task to the issue to keep track.

Signed-off-by: deepthi <deepthi@planetscale.com>

…in DownMaster case Signed-off-by: deepthi <deepthi@planetscale.com>

Signed-off-by: deepthi <deepthi@planetscale.com>

orc: endtoend test infrastructure + first two basic tests

31b5e91

Signed-off-by: deepthi <deepthi@planetscale.com>

sougou reviewed Sep 23, 2020

View reviewed changes

go/test/endtoend/orchestrator/orc_test.go Outdated Show resolved Hide resolved

go/test/endtoend/orchestrator/orc_test.go Outdated Show resolved Hide resolved

shlomi-noach reviewed Sep 23, 2020

View reviewed changes

deepthi added 4 commits September 23, 2020 11:39

orc tests: make tests independent, remove CreateDB

4ad97c5

Signed-off-by: deepthi <deepthi@planetscale.com>

orc tests: verify replication after election, leave vttablet running …

0a4cf51

…in DownMaster case Signed-off-by: deepthi <deepthi@planetscale.com>

orc tests: enable in CI

0454821

Signed-off-by: deepthi <deepthi@planetscale.com>

orc tests: change mysql test config for faster failure detection

2bf4532

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi force-pushed the ds-orc-tests branch from 0da328f to 2bf4532 Compare September 24, 2020 22:48

deepthi marked this pull request as ready for review September 24, 2020 22:50

deepthi requested a review from morgo as a code owner September 24, 2020 22:50

orc tests: regen rice-box.go to get changes to cnf file

fd0e865

Signed-off-by: deepthi <deepthi@planetscale.com>

sougou approved these changes Sep 25, 2020

View reviewed changes

sougou merged commit ac7040a into vitessio:master Sep 25, 2020

shlomi-noach deleted the ds-orc-tests branch September 29, 2020 06:40

askdba added this to the v8.0 milestone Oct 6, 2020

setassociative mentioned this pull request Mar 5, 2021

Vitess v8.0 Release branch tinyspeck/vitess#194

Merged

shlomi-noach mentioned this pull request May 31, 2021

Refactor vtorc endtoend tests #8215

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decider: endtoend test infrastructure + tests #6770

decider: endtoend test infrastructure + tests #6770

deepthi commented Sep 22, 2020 •

edited

Loading

shlomi-noach left a comment

shlomi-noach Sep 23, 2020

deepthi Sep 23, 2020

deepthi Sep 23, 2020 •

edited

Loading

shlomi-noach Sep 24, 2020

deepthi Sep 24, 2020

deepthi Sep 24, 2020

decider: endtoend test infrastructure + tests #6770

decider: endtoend test infrastructure + tests #6770

Conversation

deepthi commented Sep 22, 2020 • edited Loading

shlomi-noach left a comment

Choose a reason for hiding this comment

shlomi-noach Sep 23, 2020

Choose a reason for hiding this comment

deepthi Sep 23, 2020

Choose a reason for hiding this comment

deepthi Sep 23, 2020 • edited Loading

Choose a reason for hiding this comment

shlomi-noach Sep 24, 2020

Choose a reason for hiding this comment

deepthi Sep 24, 2020

Choose a reason for hiding this comment

deepthi Sep 24, 2020

Choose a reason for hiding this comment

deepthi commented Sep 22, 2020 •

edited

Loading

deepthi Sep 23, 2020 •

edited

Loading