Commit transaction for each configuration change #17119

ahrtr · 2023-12-14T14:41:05Z

Current when etcd remove a member, it just removes it from bbolt db, but doesn't update the cache. On the other hands, etcd periodically commit each bbolt transaction. When etcd processes the next conf change request, it might not have finished committing the previous member remove request into bbolt. Accordingly it breaks the linerizability.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

ahrtr · 2023-12-14T15:21:48Z

Confirmed that this PR can fix #17100 and also #17052

@serathius Please feel free to resurrect #17017 after this PR is merged. From what we understood so far, #17017 does not introduce regression. Instead, it just discovered an existing issue.

ahrtr · 2023-12-14T15:45:07Z

The workflow pull-etcd-e2e-amd64 actually passed, but it is displayed as failed due to "Pod got deleted unexpectedly". It might be an issue of prow? @jmhbnz

serathius · 2023-12-14T18:11:41Z

Little confused about the naming and the proposed solution. Call to s.KV().Commit() when applying membership change seems weirdly asymmetrical to normal change. Please elaborate why this needs to be flushed to disk.

Current when etcd remove a member, it just removes it from bbolt db, but doesn't update the cache. On the other hands, etcd periodically commit each bbolt transaction. When etcd processes the next conf change request, it might not have finished committing the previous member remove request into bbolt. Accordingly it breaks the linerizability.

As for naming, what do you mean by cache in when etcd remove a member, it just removes it from bbolt db, but doesn't update the cache? Do you mean RaftCluster fields? In

etcd/server/etcdserver/api/membership/cluster.go

Lines 421 to 462 in 2cf112f

    
           // RemoveMember removes a member from the store. 
        
           // The given id MUST exist, or the function panics. 
        
           func (c *RaftCluster) RemoveMember(id types.ID, shouldApplyV3 ShouldApplyV3) { 
        
           	c.Lock() 
        
           	defer c.Unlock() 
        
           	if c.v2store != nil { 
        
           		mustDeleteMemberFromStore(c.lg, c.v2store, id) 
        
           	} 
        
           	if c.be != nil && shouldApplyV3 { 
        
           		c.be.MustDeleteMemberFromBackend(id) 
        
           		m, ok := c.members[id] 
        
           		delete(c.members, id) 
        
           		c.removed[id] = true 
        
           		c.updateMembershipMetric(id, false) 
        
           		if ok { 
        
           			c.lg.Info( 
        
           				"removed member", 
        
           				zap.String("cluster-id", c.cid.String()), 
        
           				zap.String("local-member-id", c.localID.String()), 
        
           				zap.String("removed-remote-peer-id", id.String()), 
        
           				zap.Strings("removed-remote-peer-urls", m.PeerURLs), 
        
           				zap.Bool("removed-remote-peer-is-learner", m.IsLearner), 
        
           			) 
        
           		} else { 
        
           			c.lg.Warn( 
        
           				"skipped removing already removed member", 
        
           				zap.String("cluster-id", c.cid.String()), 
        
           				zap.String("local-member-id", c.localID.String()), 
        
           				zap.String("removed-remote-peer-id", id.String()), 
        
           			) 
        
           		} 
        
           	} else { 
        
           		c.lg.Info( 
        
           			"ignore already removed member", 
        
           			zap.String("cluster-id", c.cid.String()), 
        
           			zap.String("local-member-id", c.localID.String()), 
        
           			zap.String("removed-remote-peer-id", id.String()), 
        
           		) 
        
           	} 
        
           }

things seem ok.

When etcd processes the next conf change request, it might not have finished committing the previous member remove request into bbolt. Accordingly it breaks the linerizability.

Shouldn't it be ok as long consistent index in bbolt is consistent with the conf change? When I recently surprised by the differences of the apply code for updating consistent index between apply normal

etcd/server/etcdserver/server.go

Lines 1831 to 1841 in 2cf112f

    
           var ar *apply.Result 
        
           if shouldApplyV3 { 
        
           	defer func() { 
        
           		// The txPostLockInsideApplyHook will not get called in some cases, 
        
           		// in which we should move the consistent index forward directly. 
        
           		newIndex := s.consistIndex.ConsistentIndex() 
        
           		if newIndex < e.Index { 
        
           			s.consistIndex.SetConsistentIndex(e.Index, e.Term) 
        
           		} 
        
           	}() 
        
           }

and apply config change

etcd/server/etcdserver/server.go

Lines 1980 to 1993 in 2cf112f

    
           lg := s.Logger() 
        
           if err := s.cluster.ValidateConfigurationChange(cc); err != nil { 
        
           	lg.Error("Validation on configuration change failed", zap.Bool("shouldApplyV3", bool(shouldApplyV3)), zap.Error(err)) 
        
           	cc.NodeID = raft.None 
        
           	s.r.ApplyConfChange(cc) 
        
           	// The txPostLock callback will not get called in this case, 
        
           	// so we should set the consistent index directly. 
        
           	if s.consistIndex != nil && membership.ApplyBoth == shouldApplyV3 { 
        
           		applyingIndex, applyingTerm := s.consistIndex.ConsistentApplyingIndex() 
        
           		s.consistIndex.SetConsistentIndex(applyingIndex, applyingTerm) 
        
           	} 
        
           	return false, err 
        
           }

I will need some time to deep dive and understand your finding. It might take me some time. Thanks for your understanding.

serathius · 2023-12-14T18:13:25Z

@serathius Please feel free to resurrect #17017 after this PR is merged. From what we understood so far, #17017 does not introduce regression. Instead, it just discovered an existing issue.

Thanks, awesome finding! Thanks for looking into this. After we address this issue we should have a followup discussion on our testing gaps, because here we just got lucky that I added this test.

jmhbnz · 2023-12-15T02:46:48Z

The workflow pull-etcd-e2e-amd64 actually passed, but it is displayed as failed due to "Pod got deleted unexpectedly". It might be an issue of prow? @jmhbnz

I'm not certain why the pod was deleted. It may have been killed for a resource or preemption reason on the kubernetes test infra cluster.

Below are the final logs for the prowjob:

 {"Time":"2023-12-14T15:42:36.887712595Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/common","Test":"TestUserAdd_Simple/PeerAutoTLS/no_password_without_noPassword_set","Output":"/home/prow/go/src/github.com/etcd-{"component":"entrypoint","file":"k8s.io/test-infra/prow/entrypoint/run.go:173","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Entrypoint received interrupt: terminated","severity":"error","time":"2023-12-14T15:42:38Z"}
++ early_exit_handler
++ '[' -n 17 ']'
++ kill -TERM 17
++ cleanup_dind
++ [[ false == \t\r\u\e ]]
+ EXIT_VALUE=143
+ set +o xtrace
make: *** [Makefile:53: test-e2e-release] Terminated

For now I will retry it. If we see this error again I will raise an issue in k/test-infra.

/retest

server/etcdserver/server.go

server/storage/schema/membership.go

ahrtr · 2023-12-15T09:22:43Z

Just as I mentioned in #17100 (comment), there are two issues. Let me provide more context & details.

`/members` requests do not go through raft

When starting the newly added member, it gets members from one of its peer, but the /members doesn't go through raft, so the peer might haven't finish applying the previous conf change. So the new member may run into issue "member count is unequal". It's already tracked #16687.

Note this isn't a critical issue, not even a major issue,

Usually Conf changes (i.e. add/remove members) aren't frequent operations in production environment;
It is low possibility to see this issue in production.
In worst case, when it happens, just restarting the new member can resolve it.
From operators' perspectively, it also easy to avoid this issue, i.e. wait for the conf change be synced to all existing members before starting a new added member.

In our test case, I added a 100ms sleep before starting the new added member to bypass this issue.

Issue "Peer URLs already exists"

Just copied my comment inside the method MustDeleteMemberFromBackend to explain the root cause,

// We need to forcibly commit the transaction, otherwise etcd might
// run into a situation that it haven't finished committing the data
// into backend storage (note: etcd periodically commits the bbolt
// transactions instead of on each request) when it receives next
// confChange request. Accordingly, etcd may still reads the stale
// data from bbolt when processing next confChange request. So it
// breaks linearizability.
//
// Note we don't need to forcibly commit the transaction for other
// kinds of request (e.g. normal key/value operations, member add
// or update requests), because there is a buffer on top of the bbolt.
// Each time when etcd reads data from backend storage, it will read
// data from both bbolt and the buffer. But there is no such a buffer
// for member delete requests.
s.be.ForceCommit()

More context from code perspective, the buffer mentioned above are

write buffer

etcd/server/storage/backend/batch_tx.go

Line 291 in 2cf112f

buf txWriteBuffer

read buffer

etcd/server/storage/backend/read_tx.go

Line 43 in 2cf112f

buf txReadBuffer

syncs data from write buffer to read buffer at the end of the etcd TXN (Note: it isn't bbolt TXN, FYI https://drive.google.com/file/d/1GAVg206rqrL4at14Dpr837SrtFCrJ-hJ/view?usp=drive_link)

etcd/server/storage/backend/batch_tx.go

Line 310 in 2cf112f

t.buf.writeback(&t.backend.readTx.buf)

about Consistent index

Note that the issues mentioned above have nothing to do with consistent-index. If anyone has any concern or suggestion on consistent-index, Please let's discuss it in a separate session.

Current when etcd remove a member, it just removes it from bbolt db, but doesn't update the cache. On the other hands, etcd periodically commit each bbolt transaction. When etcd processes the next conf change request, it might not have commit the previous member remove request into bbolt. Accordingly it breaks the linerizability. Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>

serathius · 2023-12-15T10:21:18Z

Thanks for explanation, for me this is more scary than just "Peer URLs already exists". For me this is etcd transaction buffers breaking "readCommited" guarantee when deleting keys. Impact is limited because etcd doesn't do a lot of key deletion, as KV adds a tombstone instead of deletion. However the problem is that it breaks basic assumptions that can pop out in other places.

I have added a simple test to reproduce it #17124
I think we should fix the underlying issue (readCommited broken on delete) and not a side effect "Peer URLs already exists".

ahrtr · 2023-12-15T10:28:56Z

Please feel free to let me know if you can reproduce whatever other issue on top of this PR.

ahrtr · 2023-12-15T11:06:21Z

It turns out to be more generic issue #17124 (comment)

ahrtr · 2023-12-15T12:39:44Z

Extracted the change for the test case TestMemberReplace into a separate PR #17125, so that it can be reviewed & merged separately.

fuweid

LGTM

jmhbnz · 2023-12-17T18:13:11Z

server/storage/schema/membership.go

+	//
+	// Please also refer to
+	// https://github.com/etcd-io/etcd/pull/17119#issuecomment-1857547158
+	s.be.ForceCommit()


Should we be applying the same fix to MustSaveMemberToBackend?

No, actually the comment right above s.be.ForceCommit() already explains the reason. Please feel free to ping me if you need further clarification.

Also note this PR only resolves the issue for conf change. We need to think about a more generic solution for all similar changes on bucket other than key, i.e auth, lease, etc.

jmhbnz

LGTM - Good find @ahrtr and thanks for the detailed explanation.

ahrtr · 2024-01-11T08:23:35Z

Resolved in #17195

ahrtr force-pushed the commit_conf_20231214 branch 2 times, most recently from a1e06dd to d60a1b2 Compare December 14, 2023 14:51

ahrtr mentioned this pull request Dec 14, 2023

Reproduce issues caused by conf change validation #17114

Closed

ahrtr force-pushed the commit_conf_20231214 branch 2 times, most recently from a147643 to 247d081 Compare December 14, 2023 15:06

ahrtr mentioned this pull request Dec 14, 2023

Draft: Add member replace e2e test to repro issue 17052 #17100

Closed

ahrtr mentioned this pull request Dec 14, 2023

Add member replace e2e test #17079

Merged

ahrtr added backport/v3.4 backport/v3.5 labels Dec 14, 2023

serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 14, 2023

fuweid reviewed Dec 15, 2023

View reviewed changes

server/etcdserver/server.go Outdated Show resolved Hide resolved

ahrtr force-pushed the commit_conf_20231214 branch from 247d081 to d0d762a Compare December 15, 2023 08:56

serathius reviewed Dec 15, 2023

View reviewed changes

server/storage/schema/membership.go Show resolved Hide resolved

ahrtr force-pushed the commit_conf_20231214 branch from d0d762a to b6ded5c Compare December 15, 2023 09:29

ahrtr mentioned this pull request Dec 15, 2023

[3.5] member replace e2e test #17123

Merged

serathius mentioned this pull request Dec 15, 2023

Revise backend transaction testing #17126

Open

fuweid approved these changes Dec 17, 2023

View reviewed changes

jmhbnz reviewed Dec 17, 2023

View reviewed changes

ahrtr marked this pull request as draft December 18, 2023 09:45

k8s-ci-robot added the do-not-merge/work-in-progress label Dec 18, 2023

This was referenced Dec 21, 2023

Commit bbolt transaction if there is any pending deleting operations #17157

Closed

(*batchTx) UnsafeDelete may break the linearizability #17158

Closed

jmhbnz approved these changes Dec 26, 2023

View reviewed changes

serathius mentioned this pull request Jan 9, 2024

Fix delete inconsistencies in read buffer #17195

Merged

ahrtr closed this Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit transaction for each configuration change #17119

Commit transaction for each configuration change #17119

ahrtr commented Dec 14, 2023

ahrtr commented Dec 14, 2023 •

edited

Loading

ahrtr commented Dec 14, 2023

serathius commented Dec 14, 2023

serathius commented Dec 14, 2023

jmhbnz commented Dec 15, 2023 •

edited

Loading

ahrtr commented Dec 15, 2023

serathius commented Dec 15, 2023

ahrtr commented Dec 15, 2023

ahrtr commented Dec 15, 2023

ahrtr commented Dec 15, 2023

fuweid left a comment

jmhbnz Dec 17, 2023

ahrtr Dec 18, 2023

jmhbnz left a comment

ahrtr commented Jan 11, 2024

Commit transaction for each configuration change #17119

Commit transaction for each configuration change #17119

Conversation

ahrtr commented Dec 14, 2023

ahrtr commented Dec 14, 2023 • edited Loading

ahrtr commented Dec 14, 2023

serathius commented Dec 14, 2023

serathius commented Dec 14, 2023

jmhbnz commented Dec 15, 2023 • edited Loading

ahrtr commented Dec 15, 2023

/members requests do not go through raft

Issue "Peer URLs already exists"

about Consistent index

serathius commented Dec 15, 2023

ahrtr commented Dec 15, 2023

ahrtr commented Dec 15, 2023

ahrtr commented Dec 15, 2023

fuweid left a comment

Choose a reason for hiding this comment

jmhbnz Dec 17, 2023

Choose a reason for hiding this comment

ahrtr Dec 18, 2023

Choose a reason for hiding this comment

jmhbnz left a comment

Choose a reason for hiding this comment

ahrtr commented Jan 11, 2024

ahrtr commented Dec 14, 2023 •

edited

Loading

jmhbnz commented Dec 15, 2023 •

edited

Loading

`/members` requests do not go through raft