Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decider: endtoend test infrastructure + tests #6770

Merged
merged 6 commits into from
Sep 25, 2020

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Sep 22, 2020

Partial fix for #6769

Signed-off-by: deepthi deepthi@planetscale.com

Signed-off-by: deepthi <deepthi@planetscale.com>
go/test/endtoend/orchestrator/orc_test.go Outdated Show resolved Hide resolved
go/test/endtoend/orchestrator/orc_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Please see inline comment for TestDownMaster(), which is a general observation on the possible approaches to run the tests, and why one may be more advantageous than the other.

go/test/endtoend/cluster/cluster_process.go Outdated Show resolved Hide resolved
go/test/endtoend/orchestrator/orc_test.go Show resolved Hide resolved
go/test/endtoend/orchestrator/orc_test.go Outdated Show resolved Hide resolved
go/test/endtoend/orchestrator/orc_test.go Outdated Show resolved Hide resolved
for _, tablet := range shard0.Vttablets {
// we know we have only two tablets, so the "other" one must be master
if tablet.Alias != curMaster.Alias {
checkMasterTablet(t, tablet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, so this is a busy loop that waits for the tablet to get into some status? I think I understand now. I'd rather:

  • avoid busy loop, insert some reasonable time.Sleep(time.Second) in the loop, or else these tests will be difficult to run on one's computer and in CI
  • set some timeout for failure?
  • or, really change the approach. orchestrator's system tests work the other way around: we setup the topology, we wait 15 seconds (by way of example) and then we expect action to have taken place.
    This approach makes sense for production, because in production, you not only expect orchestrator to run a failover, you also expect it to accomplish it within a reasonable timeframe.
    Also, in this test, you wait until some state is known, and then immediately test that state. But there might be intermediate steps to the recovery that may promote one server, then another. So testing "as soon as there's some state" may land you in an intermediate situation.
    IMHO waiting X seconds (as opposed to looping for some state) is a more correct approach, I don't feel strongly; if you add Sleep+timeout instead, that also can makes sense in some situations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a busy loop. In fact, it implements something very similar to what you are suggesting - sleep 1 second between checks and timeout after 60 seconds.
I had initially tried a 15 second sleep right after starting orchestrator and checking for master tablet once. According to my testing, it takes ~21 seconds on my local machine to get a new master elected when I have "RecoveryPeriodBlockSeconds": 1, and ~41 seconds when I set that to 5.
Waiting for a fixed amount of time tends to be unreliable on CI so it is actually better to poll for a desired condition.

Copy link
Member Author

@deepthi deepthi Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you make a good point here:

But there might be intermediate steps to the recovery that may promote one server, then another. So testing "as soon as there's some state" may land you in an intermediate situation.

Let us see if we do end up in that state in any of the scenarios we want to test and if we do we can improve the checks to be aware of the possibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

41 seconds are far too long. I suspect MySQL configuration. Are replicas configured with:

  • set global slave_net_timeout = 4, see documentation. This sets a short (2sec) heartbeat interval between a replica and its master, and will make the replica recognize failure quickly. Without this setting, some scenarios may take up to a minute to detect.
  • CHANGE MASTER TO MASTER_CONNECT_RETRY=1, MASTER_RETRY_COUNT=86400. In the event of replication failure, make the replica attempt reconnection every 1sec (default is 60sec). With brief network issues this setting attempts a quick replication recovery and, if successful, will avoid a general failure/recovery operation by orchestrator.

See https://github.com/openark/orchestrator/blob/master/docs/configuration-failure-detection.md#mysql-configuration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the config file used by tests (default-fast.cnf) to set the short heartbeat interval.
I see that MASTER_RETRY_COUNT defaults to 86400 anyway. Regarding MASTER_CONNECT_RETRY, we are bringing up clusters with no master (and no replication setup) and allowing orc to elect the master. I think this means that we should make a change here to add MASTER_CONNECT_RETRY=1
https://github.com/vitessio/vitess/blob/master/go/vt/orchestrator/inst/instance_topology_dao.go#L677

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts, the changes to MASTER_RETRY_COUNT can be in a separate PR (since we will have more PRs with tests in them). I have added a task to the issue to keep track.

Signed-off-by: deepthi <deepthi@planetscale.com>
…in DownMaster case

Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi marked this pull request as ready for review September 24, 2020 22:50
@deepthi deepthi requested a review from morgo as a code owner September 24, 2020 22:50
Signed-off-by: deepthi <deepthi@planetscale.com>
@sougou sougou merged commit ac7040a into vitessio:master Sep 25, 2020
@shlomi-noach shlomi-noach deleted the ds-orc-tests branch September 29, 2020 06:40
@askdba askdba added this to the v8.0 milestone Oct 6, 2020
@shlomi-noach shlomi-noach mentioned this pull request May 31, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants