-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation: Create a data inconsistency postmortem #13967
Conversation
Codecov Report
@@ Coverage Diff @@
## main #13967 +/- ##
==========================================
- Coverage 72.67% 72.32% -0.36%
==========================================
Files 469 469
Lines 38413 38413
==========================================
- Hits 27918 27783 -135
- Misses 8727 8838 +111
- Partials 1768 1792 +24
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just couple of small comments. Thanks!
|
||
## Background | ||
|
||
Etcd v3 state is preserved on disk in two forms write ahead log (WAL) and database state (DB). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etcd instead of Etcd :P. We use etcd even if it is the first word in the sentence conventionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) +1 @xiang90 Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now noticed it at few other places :) so should do a clean sweep @serathius Thanks!
| Action Item | Type | Priority | Bug | | ||
|-------------------------------------------------------------------------------------|----------|----------|----------------------------------------------| | ||
| Etcd testing can reproduce historical data inconsistency issues | Prevent | P0 | | | ||
| Etcd detects data corruption by default | Detect | P0 | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Etcd corruption check is linearizable/negotiates common revision
- The recovery procedures are documented and tested:
- bootstaping from other member's snapshot
- bootstaping from backup
- bootstraping from backup + WAL log - Snapshots (exchanged between the leader and lagging member) are checked for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to generalize Etcd corruption check is linearizable/negotiates common revision
as etcd can reliably detect data corruption
.
I'm not sure how Snapshots (exchanged between the leader and lagging member) are checked for consistency
would relate to improve data corruption. I haven't seen this discussed, could you expand on it/file an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: #13973
* Etcd v3.5 release was not qualified as well as previous ones. Older maintainers run manual qualification process that is no longer known or executed. | ||
* Etcd apply code is so complicated that fixing the data inconsistency took almost 2 weeks and multiple tries (). Fix needed to be so complicated that we needed to develop automatic validation for it (https://github.com/etcd-io/etcd/pull/13885). | ||
* When fixing the main data inconsistency we have found multiple other edge cases that could lead to data corruption (https://github.com/etcd-io/etcd/issues/13514, https://github.com/etcd-io/etcd/issues/13922, https://github.com/etcd-io/etcd/issues/13937). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Not clearly went wrong... but the problem is that as OSS community, we don't have insight what's the production adoption of different etcd versions thus their maturity. We make version as PROD-ready after some internal feadback... to get diverse usage, but the user's hold on till someone else will discover issues. I have no more recommendation here than transparency between maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lack of feedback loop here is a persistent issue. Without a feedback loop about feature usage we cannot make a good production recommendation.
What do you think we can do to improve the situation? Could we maybe each out to CNCF about collecting usage data from users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probably start with collecting version and uptime information with opt-out by default. I know the Kubernetes community tried to do similar things. Do you know how well it goes?
cc @ahrtr @spzala @ptabor
Very drafty, if there are no obvious mistakes we can merge and iterate on it.