Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revision inconsistency caused by panic during defrag #14685

Merged
merged 2 commits into from
Nov 14, 2022

Conversation

serathius
Copy link
Member

@serathius serathius commented Nov 4, 2022

When working on adding revision support to robustness testing CI found 2 linearizability issues in https://github.com/etcd-io/etcd/actions/runs/3418916665. Possibly model is too strict but would like to discuss it openly to make sure I didn't miss anything. I'm working on reproducing them locally.

First case in single cluster test we have a case of etcd skipping 9 revisions even though there was no other writes. This breaks assumption that revision increases only on successfull write. image
Would really love to see the db file, so I think the follow up to that issue is to implement uploading etcd data directory on failed test.

Second case in 3 node cluster is even stranger, etcd reported different data for the same revision. There was a long running get transaction that returned old data with much newer revision.
image

Conclusion, there seems to be an issue with revision header returned in Get header. cc @ahrtr

@serathius
Copy link
Member Author

cc @ahrtr @spzala @ptabor @chaochn47

@codecov-commenter
Copy link

codecov-commenter commented Nov 6, 2022

Codecov Report

Merging #14685 (525a2ea) into main (f482f7b) will decrease coverage by 0.13%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #14685      +/-   ##
==========================================
- Coverage   75.56%   75.42%   -0.14%     
==========================================
  Files         457      457              
  Lines       37308    37308              
==========================================
- Hits        28191    28141      -50     
- Misses       7351     7382      +31     
- Partials     1766     1785      +19     
Flag Coverage Δ
all 75.42% <ø> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
client/pkg/v3/tlsutil/tlsutil.go 83.33% <0.00%> (-8.34%) ⬇️
server/lease/lease.go 94.87% <0.00%> (-5.13%) ⬇️
client/v3/concurrency/session.go 88.63% <0.00%> (-4.55%) ⬇️
server/etcdserver/txn/util.go 75.47% <0.00%> (-3.78%) ⬇️
server/etcdserver/api/v3rpc/util.go 70.96% <0.00%> (-3.23%) ⬇️
server/etcdserver/api/v3rpc/member.go 93.54% <0.00%> (-3.23%) ⬇️
server/etcdserver/cluster_util.go 70.35% <0.00%> (-3.17%) ⬇️
server/etcdserver/api/v3rpc/interceptor.go 73.43% <0.00%> (-3.13%) ⬇️
client/v3/concurrency/election.go 79.68% <0.00%> (-2.35%) ⬇️
server/etcdserver/api/v3rpc/auth.go 79.34% <0.00%> (-2.18%) ⬇️
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@serathius
Copy link
Member Author

ping @ahrtr

Comment on lines 122 to 119
if state.LastRevision < response.revision {
state.Value = request.putData
state.LastRevision = response.revision
return true, state
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comment https://github.com/etcd-io/etcd/pull/14685/files#r1014471776 might still valid.

Since response.err is nill, so we should save the request.putData and response.revision anyway.

Suggested change
if state.LastRevision < response.revision {
state.Value = request.putData
state.LastRevision = response.revision
return true, state
}
lastRev := state.LastRevision
state.Value = request.putData
state.LastRevision = response.revision
if lastRev < response.revision {
return true, state
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit suggestion you proposed doesn't change the logic at all, you just save data before you validate the state transition is correct.

We might need to update state.LastRevision as long as response.revision != 0

Please remember that revision is a positive number, so checking that revision is increasing like in state.LastRevision < response.revision already ensures that response.revision != 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as response.err is nil, then we need to save request.putData and response.revision to state. While your code only save them when state.LastRevision < response.revision and of course response.err is nil,.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as response.err is nil, then we need to save request.putData and response.revision to state

That's not true. Model should only allow operations that would be correct for etcd. Etcd revision should never decrease. That's why I put state.LastRevision < response.revision check.

tests/linearizability/model.go Outdated Show resolved Hide resolved
@serathius
Copy link
Member Author

The current model found 2 linearizability issues found in CI run https://github.com/etcd-io/etcd/actions/runs/3418916665. Possibly model is too strict but would like to discuss it openly to make sure I didn't miss anything. I'm working on reproducing them locally.

First case in single cluster test we have a case of etcd skipping 9 revisions even though there was no other writes. This breaks assumption that revision increases only on successfull write. image
Would really love to see the db file, so I think the follow up to that issue is to implement uploading etcd data directory on failed test.

Second case in 3 node cluster is even stranger, etcd reported different data for the same revision. There was a long running get transaction that returned old data with much newer revision.
image

Conclusion, there seems to be an issue with revision header returned in Get header. cc @ahrtr

@serathius serathius force-pushed the linearizability-revision branch 4 times, most recently from 1b9542c to 63825db Compare November 8, 2022 15:35
tests/linearizability/model.go Outdated Show resolved Hide resolved
Comment on lines 123 to 128
if state.Value == response.getData && state.LastRevision <= response.revision {
return true, state
}
if state.LastRevision >= response.revision {
return false, state
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems not correct, because both if statements cover the condition state.LastRevision == response.revision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be correct, let me explain.

First if statement validates that Get read result of last successful put. Value that is stored in state needs to equal the result in put. On the other hand second and third statement handle a put that returned an error to client but was persisted. To "restore" a lost put, there are two requirements:

  • Revision must have increased (second if statement)
  • There must be a request that failed that stored the data (third if statement).

Let me rewrite the code for this to be more clear.

@ahrtr
Copy link
Member

ahrtr commented Nov 9, 2022

Conclusion, there seems to be an issue with revision header returned in Get header. cc @ahrtr

Please note that the revision in the RangeResponse.ResponseHeader is coming from API instead of raft. We should ignore revision in RangeResponse; instead we should only check revision in PutResponse.

@serathius
Copy link
Member Author

Please note that the revision in the RangeResponse.ResponseHeader is coming from API instead of raft.

You are right, but in both cases revision should come from bbolt database. It is expected to be the revision for the latest write. For put requests it means that it is revision of the write itself. For get request it's revision of the request, but revision of latest write that happened before. I still think it should be monotonically increasing and this is what the model tests.

@ahrtr
Copy link
Member

ahrtr commented Nov 9, 2022

I think learnability test only guarantees that read request never reads stale data (of course, the revision in RangeResponse Header will >= the revision in previous write request). What you have done so far (not including this PR) is good.

But the revision in RangeResponse header is volatile (although it's always increasing), even there is no any write request on the key you are reading. As long as there are write requests on other keys, revision in RangeResponse header will change. . It means that when two read requests read the same key, one request might get new data but old revision. Please note that in a range request, reading the revision and creating a "snapshot" of the data isn't an atomic operation. See kvstore_txn.go#L46-L54

Probably you can use CreateRevision instead. But per my immediate feeling, it might not be a good idea to test revision in learnability test. Instead, we should verify the revision at the end of the test, as long as all members can converge the same revision, then it's OK. Note: I was planning to add such checker for functional test.

@serathius
Copy link
Member Author

serathius commented Nov 9, 2022

Implemented #14714 to dump data directory after the tests. I got interesting results, as mentioned before revisions in each instance does not match. Same PUT request put("key", "1093") in different etcd instances has different revision. 8662 in first member, 8669 in second member, 8656 in third member.

I got the result by downloading artifact from https://github.com/etcd-io/etcd/actions/runs/3426561978 and analysing files using etcd-dump-db iterate-bucket ./TestLinearizability_ClusterOfSize3/TestLinearizabilityClusterOfSize3-test-0/member/snap/db key --decode | head. Results

Member 0:

rev={main:8662 sub:0}, value=[key "key" | val "1093" | created 2 | mod 8662 | ver 8661]
rev={main:8661 sub:0}, value=[key "key" | val "6001081" | created 2 | mod 8661 | ver 8660]
rev={main:8660 sub:0}, value=[key "key" | val "7001068" | created 2 | mod 8660 | ver 8659]
rev={main:8659 sub:0}, value=[key "key" | val "5001144" | created 2 | mod 8659 | ver 8658]
rev={main:8658 sub:0}, value=[key "key" | val "3001079" | created 2 | mod 8658 | ver 8657]
rev={main:8657 sub:0}, value=[key "key" | val "2001142" | created 2 | mod 8657 | ver 8656]
rev={main:8656 sub:0}, value=[key "key" | val "1001060" | created 2 | mod 8656 | ver 8655]
rev={main:8655 sub:0}, value=[key "key" | val "4001071" | created 2 | mod 8655 | ver 8654]
rev={main:8654 sub:0}, value=[key "key" | val "1092" | created 2 | mod 8654 | ver 8653]
rev={main:8653 sub:0}, value=[key "key" | val "6001080" | created 2 | mod 8653 | ver 8652]

Member 1

rev={main:8669 sub:0}, value=[key "key" | val "1093" | created 2 | mod 8669 | ver 8668]
rev={main:8668 sub:0}, value=[key "key" | val "6001081" | created 2 | mod 8668 | ver 8667]
rev={main:8667 sub:0}, value=[key "key" | val "7001068" | created 2 | mod 8667 | ver 8666]
rev={main:8666 sub:0}, value=[key "key" | val "5001144" | created 2 | mod 8666 | ver 8665]
rev={main:8665 sub:0}, value=[key "key" | val "3001079" | created 2 | mod 8665 | ver 8664]
rev={main:8664 sub:0}, value=[key "key" | val "2001142" | created 2 | mod 8664 | ver 8663]
rev={main:8663 sub:0}, value=[key "key" | val "1001060" | created 2 | mod 8663 | ver 8662]
rev={main:8662 sub:0}, value=[key "key" | val "4001071" | created 2 | mod 8662 | ver 8661]
rev={main:8661 sub:0}, value=[key "key" | val "1092" | created 2 | mod 8661 | ver 8660]
rev={main:8660 sub:0}, value=[key "key" | val "6001080" | created 2 | mod 8660 | ver 8659]

Member 2:

rev={main:8656 sub:0}, value=[key "key" | val "1093" | created 2 | mod 8656 | ver 8655]
rev={main:8655 sub:0}, value=[key "key" | val "6001081" | created 2 | mod 8655 | ver 8654]
rev={main:8654 sub:0}, value=[key "key" | val "7001068" | created 2 | mod 8654 | ver 8653]
rev={main:8653 sub:0}, value=[key "key" | val "5001144" | created 2 | mod 8653 | ver 8652]
rev={main:8652 sub:0}, value=[key "key" | val "3001079" | created 2 | mod 8652 | ver 8651]
rev={main:8651 sub:0}, value=[key "key" | val "2001142" | created 2 | mod 8651 | ver 8650]
rev={main:8650 sub:0}, value=[key "key" | val "1001060" | created 2 | mod 8650 | ver 8649]
rev={main:8649 sub:0}, value=[key "key" | val "4001071" | created 2 | mod 8649 | ver 8648]
rev={main:8648 sub:0}, value=[key "key" | val "1092" | created 2 | mod 8648 | ver 8647]
rev={main:8647 sub:0}, value=[key "key" | val "6001080" | created 2 | mod 8647 | ver 8646]

@ahrtr
Copy link
Member

ahrtr commented Nov 10, 2022

I reproduced this issue locally, it seems like a real issue. Let me have a deep dive on this. Note: please ignore my previous comment, part (the "two read requests" part) of which isn't correct.

@serathius
Copy link
Member Author

serathius commented Nov 10, 2022

@ahrtr Thanks for looking into this, did you do anything specific to reproduce issue locally? With local reproduction can you validate if v3.4 and v3.5 is affected. If so we should delay v3.5.6 release.

@ahrtr
Copy link
Member

ahrtr commented Nov 10, 2022

@ahrtr Thanks for looking into this, did you do anything specific to reproduce issue locally? With local reproduction can you validate if v3.4 and v3.5 is affected. If so we should delay v3.5.6 release.

FYI. #14724

@serathius serathius force-pushed the linearizability-revision branch 2 times, most recently from 525a2ea to ac78db2 Compare November 10, 2022 12:59
@serathius
Copy link
Member Author

serathius commented Nov 10, 2022

Ok, I got a local repro. I run different etcd version that is not affected.
Looks like the cause is panic during etcd defrag. Issue was found thanks to defragBeforeRename and defragBeforeCopy failpoints added in #14687.

@serathius
Copy link
Member Author

serathius commented Nov 10, 2022

Confirmed that issue is reproducable on main branch and on v3.5.5 release. Looks like v3.4.22 is not affected. Will look into bisecting git history when I have time.

@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 10, 2022
@ahrtr
Copy link
Member

ahrtr commented Nov 10, 2022

Looks like the cause is panic during etcd defrag. Issue was found thanks to defragBeforeRename and defragBeforeCopy

Confirmed. I have the same finding.

@serathius
Copy link
Member Author

Git history bisection has pointed out 2dbecea as the cause.

@ahrtr
Copy link
Member

ahrtr commented Nov 10, 2022

The reason is that etcd reapplied duplicated entries,
duplicated_items

Right now I can exclude raft from the criminal list. Most likely it's still caused by consistent index not updated in time
wal_entries

ahrtr added a commit to ahrtr/etcd that referenced this pull request Nov 11, 2022
`unsafeCommit` is called by both `(*batchTxBuffered) commit` and
`(*backend) defrag`. When users perform the defragmentation
operation, etcd doesn't update the consistent index. If etcd
crashes(e.g. panicking) in the process for whatever reason, then
etcd replays the WAL entries starting from the latest snapshot,
accordingly it may re-apply entries which might have already been
applied, eventually the revision isn't consistent with other members.

Refer to discussion in etcd-io#14685

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@ahrtr
Copy link
Member

ahrtr commented Nov 11, 2022

Eventually I figured out the root cause, Please refer to #14730

ahrtr added a commit to ahrtr/etcd that referenced this pull request Nov 11, 2022
`unsafeCommit` is called by both `(*batchTxBuffered) commit` and
`(*backend) defrag`. When users perform the defragmentation
operation, etcd doesn't update the consistent index. If etcd
crashes(e.g. panicking) in the process for whatever reason, then
etcd replays the WAL entries starting from the latest snapshot,
accordingly it may re-apply entries which might have already been
applied, eventually the revision isn't consistent with other members.

Refer to discussion in etcd-io#14685

Signed-off-by: Benjamin Wang <wachao@vmware.com>
@serathius serathius changed the title Linearizability revision Data inconsistency caused by panic during defrag Nov 11, 2022
@serathius serathius changed the title Data inconsistency caused by panic during defrag Revision inconsistency caused by panic during defrag Nov 13, 2022
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Member Author

With fix merged with can also merge this change. ping @ahrtr

if state.Value == response.getData && state.LastRevision <= response.revision {
return true, state
}
if state.LastRevision >= response.revision {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two successive read requests may have the same revision.

Suggested change
if state.LastRevision >= response.revision {
if state.LastRevision > response.revision {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained it above, this if statements doesn't cover successive reads. It covers cases where a PUT request has failed before the read.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me.

I'll leave you to decide whether/how to resolve #14685 (comment)

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Member Author

I'll leave you to decide whether/how to resolve #14685 (comment)

Resolved the comment and added tests for it.

@serathius serathius merged commit ca8baeb into etcd-io:main Nov 14, 2022
@serathius serathius mentioned this pull request Nov 14, 2022
22 tasks
Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @serathius lgtm

@ahrtr
Copy link
Member

ahrtr commented Nov 14, 2022

All the existing etcd 3.5 releases are affected, including 3.5.0 ~ 3.5.5.

Note: etcd 3.4 doesn't have this issue. Note that there is no impact on etcd 3.5 either if performing the defragmentation operation offline using etcdutl.

FYI. https://github.com/ahrtr/etcd-issues/tree/master/issues/revision_inconsistency#how-to-workaround-this-issue

cc @dims @liggitt @deads2k @lavalamp

@ptabor ptabor mentioned this pull request Dec 6, 2022
@serathius serathius deleted the linearizability-revision branch June 15, 2023 20:37
tjungblu pushed a commit to tjungblu/etcd that referenced this pull request Jul 26, 2023
`unsafeCommit` is called by both `(*batchTxBuffered) commit` and
`(*backend) defrag`. When users perform the defragmentation
operation, etcd doesn't update the consistent index. If etcd
crashes(e.g. panicking) in the process for whatever reason, then
etcd replays the WAL entries starting from the latest snapshot,
accordingly it may re-apply entries which might have already been
applied, eventually the revision isn't consistent with other members.

Refer to discussion in etcd-io#14685

Signed-off-by: Benjamin Wang <wachao@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. stage/merge-when-tests-green
Development

Successfully merging this pull request may close these issues.

5 participants