-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposals should include a merkle root #13839
Comments
cc @jpbetz, whom I've tried to sell on this in the past :) |
Looks like Joe previously proposed this in #10893, but that went stale :( |
Yep, I have pointed to problem of this issue going stale in #13775 and wanted to propose implementing #10893 as part of graduation of corruption check in #9190 Let's abovid tracking same issue in multiple places. @lavalamp do you want to propose implementing merkele trees as graduation criteria in #9190 and close this one? |
That sounds great to me! How do I do that? |
+1. I see 3 options: b) we somehow arbitrarily partition etcd key-space into markle tree. It's hard to represent the tree-structure of keys, as it's agnostic to etcd. It would be nice for debugging to keep it preserving the key-space continuity (i.e. neighbour-sorted keys are frequently sharing the same markle-tree node) - but that's seems difficult. If we hash each key individually and build the markle tree on the hashes of keys (e.g. bit-groups are defining the nodes structure), this would work but not help isolating the problem for debugging. c) we don't need merkle tree. We maintain hash of the snapshot + chain of following hashes with all the proposals that are getting applied at the MVCC layer. This would guarantee determinism of actions performed on MVCC, but the inconsistency could originate from inside of mvcc implementation. |
Just leave a comment on #9190 |
re: a): I do not think it is a good idea to do this in the storage layer, since it has a bunch of stuff that is irrelevant to the state of the db at a given revision. (e.g., it depends on what has been compacted.) We should not make the etcd replicas hash computation depend on anything historical; it should be stateless, purely a function of the db state at a given revision. That permits e.g. efficiently un-corrupting a replica. It arrives at the correct state, but by an unconventional path. re: b): I expect the easiest thing to do is a two step process:
Given an existing such structure and a hypothetical operation, it's easyish to figure out an expected root hash. re: c): AFAICT if you do it that way, there's no way to efficiently un-corrupt the database. Additionally, even if you have a correct snapshot and a correct replay log, you can have a bug where a transaction gets applied wrong, and that doesn't get detected until the next snapshot is checked. This also requires all etcd replicas to take snapshots at the exact same time; I think that shouldn't be part of the state for correctness checking. |
OK I commented there, if that's sufficient we can close this. |
Based on comments above, there are non trivial decisions to be made, let's keep the issue open and continue the discussion on the design. |
Thought about this a long time ago. Never got time to implement though :P Hope it still helps. Step 1
Step 2
Step 3
Some other follow-up steps to enable the incremental hash in a safe way. Initially, it should be just a checking mechanism, and should not stop the cluster from functioning. As we build more confidence in this hash checking, we can start to rely on it to do hand breaking, etc.. |
A related topic - we should keep the practice of running failure injection tests before any minor releases. (maybe we still do this today?) We used to run at least 3 clusters for about 1 month. It almost always catches inconsistency bugs :P. |
+1. I vague remember that on of the inconsistency bugs we found a few years back we improved the injection testing to check for failure around some of the restart functionality as a way of trying to prevent future occurrences of that class of issue. |
+1 for more failure injection testing. Unfortunately depending on manual testing leads to inconsistency in release qualification, we need to invest more into automation. I'm planning to write a public postmortem that will go into actions we should make to prevent such issues in the future. |
How about automated testing based on Jepsen? Is that something that we run today regularly or before a release? I have some experience with it so I can help set it up. |
Automated testing is good, but the problem is not just etcd bugs that are outside of our hypothesis space (which already testing is iffy at finding) -- it's actually not even sufficient for etcd code to be 100% correct. I'm thinking in particular about RAM, disk, or network-based corruption. The real task is to have the replicas arrive at the same state in the face of this kind of error. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
This is proposed as postmortem action item and planned for v3.7 |
I've thought some more about this and I think significant gains can be had just from storing a hash of each {key,value} pair, so that corruption of individual values can be detected before building on them. This should also be simpler and a prerequisite to a global hash of some sort anyway. |
Instead of checking single KV pairs I would look into adding hashes to bbolt pages. Bbolt implements b+tree which should be extendable to include hash values like merkle tree allowing checking consistency of whole database. |
The benefit of hashing every KV pair:
Hashing bbolt pages is better than what we do now, but I don't think it's guaranteed that every replica ends up with the same bbolt layout? Also on a hash failure, recovery is very complicated, no? |
In response to #13766 and past issues (e.g. #11613) that get into the same condition:
It is possible for etcd to get into a condition where the databases on different replicas have different contents, and then proceed committing changes. There have been other routes to this in the past. I don't think it's possible to prevent all db corruption issues, but it is certainly possible to make etcd stop and recover when the DB doesn't match.
The reason etcd can proceed is because the protocol has replicas agree on the contents of changes but not on the state of the db after applying the change. The easiest fix for this is to adjust things to compute a hash covering the entire db. This can be done efficiently (log N hashing operations per db change) by using a merkle tree. If proposed changes also included a proposed new merkle root hash, a replica with a differing db would not be able to accept a change, and this condition would be caught the instant it happened.
(And it's recoverable at that point by reading the correct state from the other replicas. Moreover, depending on how you implement it, the merkle tree could be examined a layer at a time to find what is corrupted instead of needing to copy the entire state, which could be large.)
The merkle tree technique is very common in cryptocurrencies. etcd is basically a cryptocurrency with no token and a non-byzantine-resistant coordination mechanism. These are both correct tradeoffs given what etcd does, but continuing to accept changes to a corrupted state is extremely bad -- any cryptocurrency with a corresponding bug would totally go to zero.
The text was updated successfully, but these errors were encountered: