Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft node notifies configure when confChanged #15708

Merged
merged 2 commits into from
Jun 28, 2023

Conversation

chaochn47
Copy link
Member

@chaochn47 chaochn47 force-pushed the confchange_raft_node_notifies_apply branch from 20a9dba to eba61de Compare April 13, 2023 07:12
@ahrtr
Copy link
Member

ahrtr commented Apr 13, 2023

Can #15528 be easily reproduced? If not, then we need to add an integration test; we can intentionally inject a sleep failpoint before r.Advanced()

@chaochn47
Copy link
Member Author

Can #15528 be easily reproduced? If not, then we need to add an integration test; we can intentionally inject a sleep failpoint before r.Advanced()

Not easily. Inject a sleep should do the trick.

@fuweid
Copy link
Member

fuweid commented Apr 13, 2023

It's hard to reproduce in few requests. It needs to send request one by one without pause.
FYI: You can use the following test code to reproduce it.

package verify

import (
        "context"
        "strings"
        "testing"
        "time"

        integration2 "go.etcd.io/etcd/tests/v3/framework/integration"
        "go.etcd.io/etcd/tests/v3/integration"
)

var lazyCluster = integration.NewLazyClusterWithConfig(
        integration2.ClusterConfig{
                Size:                        3,
                WatchProgressNotifyInterval: 200 * time.Millisecond,
                DisableStrictReconfigCheck:  true})

func TestVerify(t *testing.T) {
        defer lazyCluster.Terminate()

        clus := lazyCluster.Cluster()
        leaderIdx := clus.WaitLeader(t)
        members := clus.Members

        followerIdx := (leaderIdx + 1) % 3
        cli := clus.Client(leaderIdx)

        // promoting any of the voting members in cluster should fail
        expectedErrKeywords := "can only promote a learner member"
        var err error
        for i := 0; i < 10000; i++ {
                ctx, cancel := context.WithTimeout(context.TODO(), 1*time.Minute)
                _, err = cli.MemberPromote(ctx, uint64(members[followerIdx].ID()))
                cancel()
                if err == nil {
                        t.Fatalf("expect promoting voting member to fail, got no error")
                }

                if strings.Contains(err.Error(), expectedErrKeywords) {
                        continue
                }

                t.Fatalf("unexpected error: %v", err)
        }
}

@serathius
Copy link
Member

serathius commented Apr 13, 2023

This PR not only changes MemberPromote, but introduces a big change for all membership changes. We should double check that all configuration changes are properly tested. If not we should add a large set of e2e tests that:

  • Grow cluster from 1 to 3 members (without learners)
  • Grow cluster from 1 to 3 members (with learners) in both 1 at the time and two at one time scenario
  • Reduce cluster from 3 to 1 member
    etc

I would not want to rush this change, so I would recommend #15528 independently.

@chaochn47
Copy link
Member Author

Can #15528 be easily reproduced? If not, then we need to add an integration test; we can intentionally inject a sleep failpoint before r.Advanced()

Hi @ahrtr could you please help clarify how would integration test inject a sleep failpoint? I know with etcd binary built with gofail enabled it is achievable.

func (f *BinaryFailpoints) Setup(ctx context.Context, failpoint, payload string) error {
host := fmt.Sprintf("127.0.0.1:%d", f.member.Config().GoFailPort)
failpointUrl := url.URL{
Scheme: "http",
Host: host,
Path: failpoint,
}
r, err := http.NewRequestWithContext(ctx, "PUT", failpointUrl.String(), bytes.NewBuffer([]byte(payload)))
if err != nil {
return err
}
resp, err := httpClient.Do(r)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusNoContent {
return fmt.Errorf("bad status code: %d", resp.StatusCode)
}
return nil
}
Thanks to @serathius as part of the robustness test work.

@ahrtr
Copy link
Member

ahrtr commented May 5, 2023

Hi @ahrtr could you please help clarify how would integration test inject a sleep failpoint? I know with etcd binary built with gofail enabled it is achievable.

Please refer to the followings,

  1. the section "Test" in https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit#heading=h.wwktogcu1ekk.
  2. https://github.com/etcd-io/bbolt/blob/97d2cc34ac943a03ca596d15e0629b163468f59f/Makefile#L77

@chaochn47
Copy link
Member Author

chaochn47 commented May 9, 2023

This PR not only changes MemberPromote, but introduces a big change for all membership changes. We should double check that all configuration changes are properly tested. If not we should add a large set of e2e tests that:

* Grow cluster from 1  to 3 members (without learners)

* Grow cluster from 1 to 3 members (with learners) in both 1 at the time and two at one time scenario

* Reduce cluster from 3 to 1 member
  etc

I would not want to rush this change, so I would recommend #15528 independently.

Hi @serathius I audited all the membership reconfiguration tests.

  1. e2e/ctl_v3_member_test.go
  2. tests/common/member_test.go
  3. tests/integration/clientv3/cluster_test.go
  4. tests/integration/clientv3/examples/example_cluster_test.go

Number 1 & 2 tests does not expose issue mentioned in #15528 because it applies configuration change only once. common tests cover all the e2e test cases so there are some duplicates.

Test case TestMemberPromoteMemberNotLearner in Number 3 test covers the scenario that back to back member promote will timeout while it does not cover MemberAdd, MemberUpdate, MemberRemove.

Number 4 example test cases share a lazy cluster that could have back to back membership reconfigurations. So that's why #12983 and #14040 is flaky.

So My proposal is to enhance tests/integration/clientv3/cluster_test.go with failpoint injected with sleep 100ms on

r.Advance()
and exercise MemberAdd, MemberUpdate, MemberRemove, MemberPromote repeatedly, what do you think? @ahrtr @serathius

… to client

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@chaochn47 chaochn47 force-pushed the confchange_raft_node_notifies_apply branch 2 times, most recently from 1f4c4b8 to 2d4c8ec Compare June 26, 2023 23:11
1. rename confChangeCh to raftAdvancedC
2. rename waitApply to confChanged
3. add comments and test assertion

Signed-off-by: Chao Chen <chaochn@amazon.com>
@chaochn47 chaochn47 force-pushed the confchange_raft_node_notifies_apply branch from 2d4c8ec to 6cdc9ae Compare June 27, 2023 05:56
@chaochn47 chaochn47 changed the title raft node notifies apply when confChanged raft node notifies configure when confChanged Jun 27, 2023
@chaochn47
Copy link
Member Author

Ping @serathius @ahrtr @fuweid @tjungblu @jmhbnz for review ~

Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment to respond to review ping. I've read through this pr a few times and at a code level it looks good, however I'm abstaining from hitting approve as I don't understand overall system implications well enough.

Copy link
Contributor

@tjungblu tjungblu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm (non-binding)

Thanks for finally fixing this!

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks!

@ahrtr ahrtr merged commit 22f9dac into etcd-io:main Jun 28, 2023
@chaochn47 chaochn47 deleted the confchange_raft_node_notifies_apply branch June 28, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Flaky Test TestMemberPromoteMemberNotLearner
6 participants