Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit transaction for each configuration change #17119

Closed
wants to merge 1 commit into from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Dec 14, 2023

Current when etcd remove a member, it just removes it from bbolt db, but doesn't update the cache. On the other hands, etcd periodically commit each bbolt transaction. When etcd processes the next conf change request, it might not have finished committing the previous member remove request into bbolt. Accordingly it breaks the linerizability.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@ahrtr
Copy link
Member Author

ahrtr commented Dec 14, 2023

Confirmed that this PR can fix #17100 and also #17052

@serathius Please feel free to resurrect #17017 after this PR is merged. From what we understood so far, #17017 does not introduce regression. Instead, it just discovered an existing issue.

@ahrtr
Copy link
Member Author

ahrtr commented Dec 14, 2023

The workflow pull-etcd-e2e-amd64 actually passed, but it is displayed as failed due to "Pod got deleted unexpectedly". It might be an issue of prow? @jmhbnz

@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 14, 2023
@serathius
Copy link
Member

Little confused about the naming and the proposed solution. Call to s.KV().Commit() when applying membership change seems weirdly asymmetrical to normal change. Please elaborate why this needs to be flushed to disk.

Current when etcd remove a member, it just removes it from bbolt db, but doesn't update the cache. On the other hands, etcd periodically commit each bbolt transaction. When etcd processes the next conf change request, it might not have finished committing the previous member remove request into bbolt. Accordingly it breaks the linerizability.

As for naming, what do you mean by cache in when etcd remove a member, it just removes it from bbolt db, but doesn't update the cache? Do you mean RaftCluster fields? In

// RemoveMember removes a member from the store.
// The given id MUST exist, or the function panics.
func (c *RaftCluster) RemoveMember(id types.ID, shouldApplyV3 ShouldApplyV3) {
c.Lock()
defer c.Unlock()
if c.v2store != nil {
mustDeleteMemberFromStore(c.lg, c.v2store, id)
}
if c.be != nil && shouldApplyV3 {
c.be.MustDeleteMemberFromBackend(id)
m, ok := c.members[id]
delete(c.members, id)
c.removed[id] = true
c.updateMembershipMetric(id, false)
if ok {
c.lg.Info(
"removed member",
zap.String("cluster-id", c.cid.String()),
zap.String("local-member-id", c.localID.String()),
zap.String("removed-remote-peer-id", id.String()),
zap.Strings("removed-remote-peer-urls", m.PeerURLs),
zap.Bool("removed-remote-peer-is-learner", m.IsLearner),
)
} else {
c.lg.Warn(
"skipped removing already removed member",
zap.String("cluster-id", c.cid.String()),
zap.String("local-member-id", c.localID.String()),
zap.String("removed-remote-peer-id", id.String()),
)
}
} else {
c.lg.Info(
"ignore already removed member",
zap.String("cluster-id", c.cid.String()),
zap.String("local-member-id", c.localID.String()),
zap.String("removed-remote-peer-id", id.String()),
)
}
}
things seem ok.

When etcd processes the next conf change request, it might not have finished committing the previous member remove request into bbolt. Accordingly it breaks the linerizability.

Shouldn't it be ok as long consistent index in bbolt is consistent with the conf change? When I recently surprised by the differences of the apply code for updating consistent index between apply normal

var ar *apply.Result
if shouldApplyV3 {
defer func() {
// The txPostLockInsideApplyHook will not get called in some cases,
// in which we should move the consistent index forward directly.
newIndex := s.consistIndex.ConsistentIndex()
if newIndex < e.Index {
s.consistIndex.SetConsistentIndex(e.Index, e.Term)
}
}()
}
and apply config change
lg := s.Logger()
if err := s.cluster.ValidateConfigurationChange(cc); err != nil {
lg.Error("Validation on configuration change failed", zap.Bool("shouldApplyV3", bool(shouldApplyV3)), zap.Error(err))
cc.NodeID = raft.None
s.r.ApplyConfChange(cc)
// The txPostLock callback will not get called in this case,
// so we should set the consistent index directly.
if s.consistIndex != nil && membership.ApplyBoth == shouldApplyV3 {
applyingIndex, applyingTerm := s.consistIndex.ConsistentApplyingIndex()
s.consistIndex.SetConsistentIndex(applyingIndex, applyingTerm)
}
return false, err
}

I will need some time to deep dive and understand your finding. It might take me some time. Thanks for your understanding.

@serathius
Copy link
Member

@serathius Please feel free to resurrect #17017 after this PR is merged. From what we understood so far, #17017 does not introduce regression. Instead, it just discovered an existing issue.

Thanks, awesome finding! Thanks for looking into this. After we address this issue we should have a followup discussion on our testing gaps, because here we just got lucky that I added this test.

@jmhbnz
Copy link
Member

jmhbnz commented Dec 15, 2023

The workflow pull-etcd-e2e-amd64 actually passed, but it is displayed as failed due to "Pod got deleted unexpectedly". It might be an issue of prow? @jmhbnz

I'm not certain why the pod was deleted. It may have been killed for a resource or preemption reason on the kubernetes test infra cluster.

Below are the final logs for the prowjob:

 {"Time":"2023-12-14T15:42:36.887712595Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/common","Test":"TestUserAdd_Simple/PeerAutoTLS/no_password_without_noPassword_set","Output":"/home/prow/go/src/github.com/etcd-{"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:173","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2023-12-14T15:42:38Z"}
++ early_exit_handler
++ '[' -n 17 ']'
++ kill -TERM 17
++ cleanup_dind
++ [[ false == \t\r\u\e ]]
+ EXIT_VALUE=143
+ set +o xtrace
make: *** [Makefile:53: test-e2e-release] Terminated 

For now I will retry it. If we see this error again I will raise an issue in k/test-infra.

/retest

server/etcdserver/server.go Outdated Show resolved Hide resolved
@ahrtr
Copy link
Member Author

ahrtr commented Dec 15, 2023

Just as I mentioned in #17100 (comment), there are two issues. Let me provide more context & details.

/members requests do not go through raft

When starting the newly added member, it gets members from one of its peer, but the /members doesn't go through raft, so the peer might haven't finish applying the previous conf change. So the new member may run into issue "member count is unequal". It's already tracked #16687.

Note this isn't a critical issue, not even a major issue,

  • Usually Conf changes (i.e. add/remove members) aren't frequent operations in production environment;
  • It is low possibility to see this issue in production.
  • In worst case, when it happens, just restarting the new member can resolve it.
  • From operators' perspectively, it also easy to avoid this issue, i.e. wait for the conf change be synced to all existing members before starting a new added member.

In our test case, I added a 100ms sleep before starting the new added member to bypass this issue.

Issue "Peer URLs already exists"

Just copied my comment inside the method MustDeleteMemberFromBackend to explain the root cause,

// We need to forcibly commit the transaction, otherwise etcd might
// run into a situation that it haven't finished committing the data
// into backend storage (note: etcd periodically commits the bbolt
// transactions instead of on each request) when it receives next
// confChange request. Accordingly, etcd may still reads the stale
// data from bbolt when processing next confChange request. So it
// breaks linearizability.
//
// Note we don't need to forcibly commit the transaction for other
// kinds of request (e.g. normal key/value operations, member add
// or update requests), because there is a buffer on top of the bbolt.
// Each time when etcd reads data from backend storage, it will read
// data from both bbolt and the buffer. But there is no such a buffer
// for member delete requests.
s.be.ForceCommit()

More context from code perspective, the buffer mentioned above are

write buffer

buf txWriteBuffer

read buffer

buf txReadBuffer

syncs data from write buffer to read buffer at the end of the etcd TXN (Note: it isn't bbolt TXN, FYI https://drive.google.com/file/d/1GAVg206rqrL4at14Dpr837SrtFCrJ-hJ/view?usp=drive_link)

t.buf.writeback(&t.backend.readTx.buf)

about Consistent index

Note that the issues mentioned above have nothing to do with consistent-index. If anyone has any concern or suggestion on consistent-index, Please let's discuss it in a separate session.

Current when etcd remove a member, it just removes it from bbolt db,
but doesn't update the cache. On the other hands, etcd periodically
commit each bbolt transaction. When etcd processes the next conf
change request, it might not have commit the previous member remove
request into bbolt. Accordingly it breaks the linerizability.

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
@serathius
Copy link
Member

Thanks for explanation, for me this is more scary than just "Peer URLs already exists". For me this is etcd transaction buffers breaking "readCommited" guarantee when deleting keys. Impact is limited because etcd doesn't do a lot of key deletion, as KV adds a tombstone instead of deletion. However the problem is that it breaks basic assumptions that can pop out in other places.

I have added a simple test to reproduce it #17124
I think we should fix the underlying issue (readCommited broken on delete) and not a side effect "Peer URLs already exists".

@ahrtr
Copy link
Member Author

ahrtr commented Dec 15, 2023

Please feel free to let me know if you can reproduce whatever other issue on top of this PR.

@ahrtr
Copy link
Member Author

ahrtr commented Dec 15, 2023

It turns out to be more generic issue #17124 (comment)

@ahrtr
Copy link
Member Author

ahrtr commented Dec 15, 2023

Extracted the change for the test case TestMemberReplace into a separate PR #17125, so that it can be reviewed & merged separately.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

//
// Please also refer to
// https://github.com/etcd-io/etcd/pull/17119#issuecomment-1857547158
s.be.ForceCommit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be applying the same fix to MustSaveMemberToBackend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually the comment right above s.be.ForceCommit() already explains the reason. Please feel free to ping me if you need further clarification.

Also note this PR only resolves the issue for conf change. We need to think about a more generic solution for all similar changes on bucket other than key, i.e auth, lease, etc.

@ahrtr ahrtr marked this pull request as draft December 18, 2023 09:45
Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Good find @ahrtr and thanks for the detailed explanation.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 11, 2024

Resolved in #17195

@ahrtr ahrtr closed this Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/v3.4 backport/v3.5 do-not-merge/work-in-progress priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Development

Successfully merging this pull request may close these issues.

5 participants