Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Create a data inconsistency postmortem #13967

Merged
merged 3 commits into from
Apr 24, 2022

Conversation

serathius
Copy link
Member

@serathius serathius commented Apr 20, 2022

cc @ahrtr @spzala @ptabor
Very drafty, if there are no obvious mistakes we can merge and iterate on it.

@codecov-commenter
Copy link

codecov-commenter commented Apr 20, 2022

Codecov Report

Merging #13967 (d87cca1) into main (4555fc3) will decrease coverage by 0.35%.
The diff coverage is 100.00%.

❗ Current head d87cca1 differs from pull request most recent head 7fe1bf5. Consider uploading reports for the commit 7fe1bf5 to get more accurate results

@@            Coverage Diff             @@
##             main   #13967      +/-   ##
==========================================
- Coverage   72.67%   72.32%   -0.36%     
==========================================
  Files         469      469              
  Lines       38413    38413              
==========================================
- Hits        27918    27783     -135     
- Misses       8727     8838     +111     
- Partials     1768     1792      +24     
Flag Coverage Δ
all 72.32% <100.00%> (-0.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/etcdserver/server.go 84.14% <100.00%> (-0.81%) ⬇️
client/v3/leasing/util.go 88.33% <0.00%> (-10.00%) ⬇️
client/v3/namespace/watch.go 87.87% <0.00%> (-6.07%) ⬇️
server/storage/mvcc/watchable_store.go 85.14% <0.00%> (-5.80%) ⬇️
client/v3/concurrency/session.go 88.63% <0.00%> (-4.55%) ⬇️
client/v3/leasing/cache.go 87.77% <0.00%> (-3.89%) ⬇️
server/etcdserver/api/v3rpc/watch.go 84.22% <0.00%> (-3.70%) ⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go 71.30% <0.00%> (-3.48%) ⬇️
server/etcdserver/api/v3rpc/util.go 70.96% <0.00%> (-3.23%) ⬇️
server/etcdserver/api/v3rpc/member.go 93.54% <0.00%> (-3.23%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4555fc3...7fe1bf5. Read the comment docs.

Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just couple of small comments. Thanks!


## Background

Etcd v3 state is preserved on disk in two forms write ahead log (WAL) and database state (DB).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd instead of Etcd :P. We use etcd even if it is the first word in the sentence conventionally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) +1 @xiang90 Good catch!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now noticed it at few other places :) so should do a clean sweep @serathius Thanks!

| Action Item | Type | Priority | Bug |
|-------------------------------------------------------------------------------------|----------|----------|----------------------------------------------|
| Etcd testing can reproduce historical data inconsistency issues | Prevent | P0 | |
| Etcd detects data corruption by default | Detect | P0 | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Etcd corruption check is linearizable/negotiates common revision
  • The recovery procedures are documented and tested:
    - bootstaping from other member's snapshot
    - bootstaping from backup
    - bootstraping from backup + WAL log
  • Snapshots (exchanged between the leader and lagging member) are checked for consistency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to generalize Etcd corruption check is linearizable/negotiates common revision as etcd can reliably detect data corruption.

I'm not sure how Snapshots (exchanged between the leader and lagging member) are checked for consistency would relate to improve data corruption. I haven't seen this discussed, could you expand on it/file an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: #13973

* Etcd v3.5 release was not qualified as well as previous ones. Older maintainers run manual qualification process that is no longer known or executed.
* Etcd apply code is so complicated that fixing the data inconsistency took almost 2 weeks and multiple tries (). Fix needed to be so complicated that we needed to develop automatic validation for it (https://github.com/etcd-io/etcd/pull/13885).
* When fixing the main data inconsistency we have found multiple other edge cases that could lead to data corruption (https://github.com/etcd-io/etcd/issues/13514, https://github.com/etcd-io/etcd/issues/13922, https://github.com/etcd-io/etcd/issues/13937).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Not clearly went wrong... but the problem is that as OSS community, we don't have insight what's the production adoption of different etcd versions thus their maturity. We make version as PROD-ready after some internal feadback... to get diverse usage, but the user's hold on till someone else will discover issues. I have no more recommendation here than transparency between maintainers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lack of feedback loop here is a persistent issue. Without a feedback loop about feature usage we cannot make a good production recommendation.

What do you think we can do to improve the situation? Could we maybe each out to CNCF about collecting usage data from users?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably start with collecting version and uptime information with opt-out by default. I know the Kubernetes community tried to do similar things. Do you know how well it goes?

@serathius serathius changed the title Documentation: Create a draft data inconsistency postmortem Documentation: Create a data inconsistency postmortem Apr 22, 2022
@serathius serathius merged commit c3ef240 into etcd-io:main Apr 24, 2022
@serathius serathius deleted the pm branch June 15, 2023 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants