Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible split-brain & loss of committed write with loss of un-fsynced data #14143

Open
aphyr opened this issue Jun 22, 2022 · 17 comments
Open
Labels
help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.5 stage/tracked type/bug

Comments

@aphyr
Copy link

aphyr commented Jun 22, 2022

What happened?

With a five-node etcd 3.5.3 cluster running on Debian stable, with a data directory mounted on the lazyfs filesystem, killing nodes and losing their un-fsynced writes appears to infrequently cause the loss of committed writes. Here's an example test run where we appended 4 to key 233 via a kv transaction guarded by a revision comparison:

{:type :ok, :f :txn, :value [[:append 233 4] [:r 232 [1 2 9 11 16 24]] [:r 233 [4]] [:r 234 nil]], :time 238730989477, :process 647, :index 12142}

This transaction completed successfully, and its write of 4 was visible to reads for at least 690 milliseconds. Then we killed every node in the cluster and restarted them:

{:type :info, :f :kill, :value :all, :time 239418563385, :process :nemesis, :index 12224}
{:type :info, :f :kill, :value {"n1" :done, "n2" :done, "n3" :done, "n4" :done, "n5" :done}, :time 241472326880, :process :nemesis, :index 12243}
{:type :info, :f :start, :value :all, :time 263489342471, :process :nemesis, :index 12412}
{:type :info, :f :start, :value {"n1" :started, "n2" :started, "n3" :started, "n4" :started, "n5" :started}, :time 263573656645, :process :nemesis, :index 12413}

Upon restarting, 4 was no longer visible. Instead readers observed that key 233 did not exist, and writers began appending new values starting with 18:

{:type :ok, :f :txn, :value [[:r 237 nil] [:r 233 nil] [:r 237 nil] [:r 237 nil]], :time 266669045101, :process 823, :index 12447}
{:type :ok, :f :txn, :value [[:append 233 18] [:r 235 nil]], :time 266738029770, :process 813, :index 12454}

Here are all reads of key 233, arranged in chronological order from top to bottom:

Screenshot from 2022-06-22 09-14-09

This history cannot be serializable: every append transaction involves a read of any key to be written, followed by a write transaction which compares those keys to make sure all their revisions are the same as the versions which were originally read. In such a history no writes should be lost--and yet here, etcd appears to have lost the write of 4.

As with previous lazyfs issues, this may be due to a bug in lazyfs, or it might represent a failure of etcd to fsync values on a majority of nodes before committing them.

What did you expect to happen?

Even with the loss of un-fsynced writes, etcd should not lose committed transactions.

How can we reproduce it (as minimally and precisely as possible)?

With https://github.com/jepsen-io/etcd 181656bb551bbc10cdc3d959866637574cdc9e17 and Jepsen 9e40a61d89de1343e06b9e8d1f77fe2c0be2e6ec, run:

lein run test-all -w append --rate 1000 --concurrency 4n --lazyfs --nemesis kill --time-limit 300 --test-count 50 --nemesis-interval 15

Anything else we need to know?

No response

Etcd version (please run commands below)

etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

`etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380 --listen-client-urls http://192.168.122.101:2379 --advertise-client-urls http://192.168.122.101:2379 --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380 --initial-cluster n1=http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380 --snapshot-count 100 --experimental-initial-corrupt-check`

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

root@n1:/opt/etcd# ./etcdctl member list -w table --endpoints=n1:2379
+------------------+---------+------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  | NAME |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+------+-----------------------------+-----------------------------+------------+
| 1153c9690d2b2284 | started |   n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 |      false |
| 4824313a421b2502 | started |   n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 |      false |
| 4d6e27d122507e9c | started |   n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 |      false |
| a1ffd5acd6a88a6a | started |   n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 |      false |
| afa39e55dee6dc2e | started |   n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 |      false |
+------------------+---------+------+-----------------------------+-----------------------------+------------+

root@n1:/opt/etcd# ./etcdctl --endpoints=http://192.168.122.105:2379,http://192.168.122.103:2379,http://192.168.122.104:2379,http://192.168.122.102:2379,http://192.168.122.101:2379 endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.122.105:2379 | 4824313a421b2502 |   3.5.3 |  4.2 MB |      true |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.103:2379 | 1153c9690d2b2284 |   3.5.3 |  4.2 MB |     false |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.104:2379 | 4d6e27d122507e9c |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.102:2379 | a1ffd5acd6a88a6a |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.101:2379 | afa39e55dee6dc2e |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

No response

@aphyr aphyr added the type/bug label Jun 22, 2022
@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 22, 2022
@aphyr aphyr changed the title Possible Loss of committed write with loss of un-fsynced data Possible loss of committed write with loss of un-fsynced data Jun 22, 2022
@aphyr
Copy link
Author

aphyr commented Jun 22, 2022

Oof, here's an even worse run. Here we appear to have gotten into some kind of full split-brain where values on different replicas diverged at some point, and the cluster proceeded in two independent timelines. Here you can see reads on node n3 (processes 382, 392, and 397, modulo 5 nodes, are all 2, which is the third node in the cluster) saw a version of key 150 which went [1 2 3 13 20 30], whereas other nodes saw [1 4 8 12 18 19 21 22].

Screenshot from 2022-06-22 13-08-54

Here's the full logs from this run, including etcd data files.

At 12:39:28 we killed every node, and at 12:39:49 we restarted them all.

2022-06-22 12:39:28,541{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :kill   :all
2022-06-22 12:39:30,591{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :kill   {"n1" :done, "n2" :done, "n3" :done, "n4" :done, "n5" :done}
2022-06-22 12:39:49,206{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :start  :all
2022-06-22 12:39:49,289{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :start  {"n1" :started, "n2" :started, "n3" :started, "n4" :started, "n5" :started}

Immediately on startup n3 repaired a WAL file, noting an unexpected EOF:

{"level":"info","ts":"2022-06-22T12:39:49.279-0400","caller":"wal/repair.go:40","msg":"repairing","path":"n3.etcd/member/wal/0000000000000000-0000000000000000.wal"}
{"level":"info","ts":"2022-06-22T12:39:50.386-0400","caller":"wal/repair.go:96","msg":"repaired","path":"n3.etcd/member/wal/0000000000000000-0000000000000000.wal","error":"unexpected EOF"}
{"level":"info","ts":"2022-06-22T12:39:50.386-0400","caller":"etcdserver/storage.go:109","msg":"repaired WAL","error":"unexpected EOF"}

n3 clearly knew about the other members in the cluster, because it recovered them from store at 12:39:50

{"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"4824313a421b2502","recovered-remote-peer-urls":["http://192.168.122.105:2380"]}
{"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"4d6e27d122507e9c","recovered-remote-peer-urls":["http://192.168.122.104:2380"]}
{"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"a1ffd5acd6a88a6a","recovered-remote-peer-urls":["http://192.168.122.102:2380"]}
{"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"afa39e55dee6dc2e","recovered-remote-peer-urls":["http://192.168.122.101:2380"]}
{"level":"info","ts":"2022-06-22T12:39:50.963-0400","caller":"membership/cluster.go:278","msg":"recovered/added member from store","cluster-id":"76f85bbb28f6ace1","local-member-id":"1153c9690d2b2284","recovered-remote-peer-id":"1153c9690d2b2284","recovered-remote-peer-urls":["http://192.168.122.103:2380"]}

n3 recorded that its peers were active around 12:40:17

{"level":"info","ts":"2022-06-22T12:40:17.830-0400","caller":"rafthttp/peer_status.go:53","msg":"peer became active","peer-id":"a1ffd5acd6a88a6a"}

And noted that it had a conflict in the log at index 906 between a local entry with term 4 and a conflicting entry at term 5:

{"level":"info","ts":"2022-06-22T12:40:18.043-0400","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"1153c9690d2b2284 became follower at term 5"}
{"level":"info","ts":"2022-06-22T12:40:18.043-0400","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"1153c9690d2b2284 [logterm: 4, index: 907, vote: 0] rejected MsgVote from 4d6e27d122507e9c [logterm: 4, index: 905] at term 5"}
{"level":"info","ts":"2022-06-22T12:40:18.044-0400","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"found conflict at index 906 [existing term: 4, conflicting term: 5]"}
{"level":"info","ts":"2022-06-22T12:40:18.044-0400","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"replace the unstable entries from index 906"}

n3 went on to serve client requests at 12:40:18:

{"level":"info","ts":"2022-06-22T12:40:18.047-0400","caller":"embed/serve.go:98","msg":"ready to serve client requests"}

The very first requests to execute after this point show that n3 diverged from the rest of the cluster. Inconsistencies appeared in hundreds of keys. n1 and n4 believed key 78 was [19 30 31], when n3 thought it was [8 14 18 26 29].

2022-06-22 12:40:23,095{GMT}    INFO    [jepsen worker 10] jepsen.util: 410     :ok     :txn    [[:r 78 [19]] [:r 79 [1 2]] [:r 78 [19]] [:append 79 5]]
2022-06-22 12:40:23,099{GMT}    INFO    [jepsen worker 8] jepsen.util: 388      :ok     :txn    [[:append 78 31] [:r 78 [19 30 31]] [:append 75 23] [:append 79 7]]
2022-06-22 12:40:23,100{GMT}    INFO    [jepsen worker 12] jepsen.util: 392     :ok     :txn    [[:r 78 [8 14 18 26 29]]]

Screenshot from 2022-06-22 14-32-02

Here's another case: key 87 diverged into [1 2] and [1 3 4]. Again, node n3 appears to have been acting independently from the other nodes:

2022-06-22 12:40:23,335{GMT}    INFO    [jepsen worker 12] jepsen.util: 392     :ok     :txn    [[:r 87 [1 2]] [:r 87 [1 2]]]
2022-06-22 12:40:23,338{GMT}    INFO    [jepsen worker 10] jepsen.util: 410     :ok     :txn    [[:append 87 4] [:r 87 [1 3 4]] [:r 87 [1 3 4]] [:r 87 [1 3 4]]]

Screenshot from 2022-06-22 14-32-26

At 12:40:32 we killed all nodes, and at 12:41:01 we restarted them.

022-06-22 12:40:32,662{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :kill   :all
2022-06-22 12:40:34,759{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :kill   {"n1" :done, "n2" :done, "n3" :done, "n4" :done, "n5" :done}
2022-06-22 12:41:01,673{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :start  :all
2022-06-22 12:41:01,756{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :start  {"n1" :started, "n2" :started, "n3" :started, "n4" :started, "n5" :started}

At 12:41:30.322, node n3 logged that it had detected an inconsistency with a peer, and appears to have shut down:

{"level":"error","ts":"2022-06-22T12:41:30.320-0400","caller":"embed/etcd.go:259","msg":"checkInitialHashKV failed","error":"1153c9690d2b2284 found data inconsistency with peers","stacktrace":"go.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:259\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
...
{"level":"info","ts":"2022-06-22T12:41:30.322-0400","caller":"embed/etcd.go:370","msg":"closed etcd server","name":"n3","data-dir":"n3.etcd","advertise-peer-urls":["http://192.168.122.103:2380"],"advertise-client-urls":["http://192.168.122.103:2379"]}

At 12:31:35 Jepsen restarted the crashed n3:

2022-06-22 12:41:35,166{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :start  :all
2022-06-22 12:41:35,248{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :start  {"n1" :already-running, "n2" :already-running, "n3" :started, "n4" :already-running, "n5" :already-running}

And n3 crashed again, complaining of data inconsistency:

{"level":"error","ts":"2022-06-22T12:41:35.283-0400","caller":"embed/etcd.go:259","msg":"checkInitialHashKV failed","error":"1153c9690d2b2284 found data inconsistency with peers","stacktrace":"go.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:259\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

At 12:41:38 Jepsen killed every node, and restarted them all at 12:42:01:

2022-06-22 12:41:38,199{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :kill   :all
2022-06-22 12:41:40,254{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :kill   {"n1" :done, "n2" :done, "n3" :done, "n4" :done, "n5" :done}
2022-06-22 12:42:01,589{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :start  :all
2022-06-22 12:42:01,674{GMT}    INFO    [jepsen worker nemesis] jepsen.util: :nemesis   :info   :start  {"n1" :started, "n2" :started, "n3" :started, "n4" :started, "n5" :started}

However, this time etcd did not crash, and did not detect data inconsistency for the rest of the test.

@aphyr aphyr changed the title Possible loss of committed write with loss of un-fsynced data Possible split-brain & loss of committed write with loss of un-fsynced data Jun 22, 2022
@aphyr
Copy link
Author

aphyr commented Jun 23, 2022

I've got a good dozen of these cases now. Also seeing a couple new behaviors! In this case, clients started returning errors like DATA_LOSS: etcdserver: corrupt cluster, and servers also logged:

{"level":"info","ts":"2022-06-23T05:23:40.237-0400","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"n3.etcd/member/snap/db","took":"50.021389ms"}
{"level":"warn","ts":"2022-06-23T05:23:40.238-0400","caller":"wal/util.go:90","msg":"ignored file in WAL directory","path":"0000000000000000-0000000000000000.wal.broken"}
{"level":"info","ts":"2022-06-23T05:23:40.394-0400","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":88136,"snapshot-size":"26 kB"}
{"level":"warn","ts":"2022-06-23T05:23:40.396-0400","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":88136,"snapshot-file-path":"n3.etcd/member/snap/0000000000015848.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-06-23T05:23:40.396-0400","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000266480, 0xc0005e8380, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0005b8000, 0x12346a6, 0x2a, 0xc0005e8380, 0x1, 0x1)
	/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0x7fffd59479ee, 0x2, 0x0, 0x0, 0x0, 0x0, 0xc0005ea000, 0x1, 0x1, 0xc0005ea240, ...)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc0005e2000, 0xc0005e2600, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc0005e2000, 0x120893e, 0x6, 0xc000128101, 0x2)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a1a0, 0x19, 0x1a)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a1a0, 0x19, 0x1a)
	/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f
main.main()
	/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32 +0x45

This was accompanied by basically every single type of anomaly we know how to detect, including lost updates, split-brain, aborted read, write cycles, cyclic information flow, etc etc.

            :anomaly-types (:G-nonadjacent
                            :G-nonadjacent-realtime
                            :G-single
                            :G-single-realtime
                            :G0
                            :G0-realtime
                            :G1a
                            :G1c
                            :G1c-realtime
                            :G2-item
                            :G2-item-realtime
                            :cycle-search-timeout
                            :dirty-update
                            :incompatible-order
                            :lost-update),

@ahrtr
Copy link
Member

ahrtr commented Jun 24, 2022

@aphyr Thanks for raising this issue.

There was an issue in 3.5.3 , and it's fixed in 3.5.4 in 13946 .

So could you reproduce this issue on 3.5.4 instead of 3.5.3?

@ahrtr
Copy link
Member

ahrtr commented Jul 4, 2022

@aphyr can you still reproduce this issue on 3.5.4?

@endocrimes did you ever see this issue on 3.5.4?

cc @serathius

@aphyr
Copy link
Author

aphyr commented Jul 4, 2022

Hi @ahrtr! We've been tracking down bugs in lazyfs itself--still don't have all of them sanded off yet, so it's possible this may not be etcd's fault. I'm afraid my own contract working on etc analysis ended last week, so I won't be able to devote much time to this, but I'll try to come back and do a second pass once we have lazyfs in a better spot.

@endocrimes
Copy link
Contributor

I'm trying to see if this still replicates on 3.5.4 and master - I think I have it repro-ing but it also depends on lazyfs rn

@serathius serathius mentioned this issue Jul 6, 2022
16 tasks
@serathius
Copy link
Member

serathius commented Jul 12, 2022

Based on m private discussion with @endocrimes this is reproducible on v3.4

@ramses
Copy link

ramses commented Jul 28, 2022

Hi @aphyr,

Just to be sure ... what's your LazyFS-toml-config for these tests?

@aphyr
Copy link
Author

aphyr commented Aug 1, 2022

Hey there! Sorry I don't have this available off the top of my head, and I'm completely swamped right now with other work stuff--no longer working on etcd testing. I did include repro instructions in the issue though, and that'll spit out a file in... I think /opt/etcd/data.lazyfs or something like that!

@endocrimes
Copy link
Contributor

I think I've managed to accidentally confirm the existence of this bug.

When starting to investigate I short circuited the MustSync method used as part of Raft transactions to always return true, with this, the issue became significantly harder to reproduce (managed to have 1 failure still over many runs).

This somewhat implies that there is a path that is writing data and failing to fsync it. Next on my list is properly tracing through the data path for kv writes to both get a deeper understanding of where that might be. Probably some time next week with my current backlog in k/k.

@ramses
Copy link

ramses commented Aug 2, 2022

I was looking into paths to MustSync; but, it did not occurred to me to return true. Thanks for pruning the search space, @endocrimes .

@stale
Copy link

stale bot commented Nov 1, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@CaojiamingAlan
Copy link
Contributor

CaojiamingAlan commented Jul 3, 2023

My guess: MustSync does not check commit index, and the updates of commit index is not fsync-ed. This violates the usage of raft:

Write Entries, HardState and Snapshot to persistent storage in order.

Example:

  1. Start three nodes 1, 2, 3
  2. node 1 elected as leader in term 1
  3. node 1 propose a write key:value
  4. node 1 sends MsgApp to node 2 and 3
  5. node 3 is down
  6. node 2 replies MsgAppResp to node 1
  7. node 1 receives the MsgAppResp from node 2, marks entry as already replicated on a majority of nodes and committed, the raft library exposes this by updating the Commit field in HardState and CommittedEntries of Ready, but the Entries is nil since no new entries are proposed.
  8. node 1 save() this to WAL, but MustSync gives a false, since Entries is nil, and Vote and Term are unchanged.
  9. WAL is not fsync-ed to disk, commit index in on-disk WAL is stale, however in the user's view, the transaction is successful and entry is committed.
  10. node 1 is down.
  11. node 1 restarts, recover its WAL which is not fsync-ed.
  12. node 3 restarts, start a new round of election, and is elected as leader. This is legal since the commit index in node 1's WAL is stale, and node 3's commit index is no less than node 1's.
  13. node 3 force the cluster to accept its log, and the committed entry is lost.

The scenario is so tricky and I am not sure how to reproduce it. We may need an integration of LazyFS, the interaction test suite in the raft library, and the e2e test suite.

Anyone please evaluate whether the above case makes sense.
@serathius @ahrtr

=================================================================
Update:
I'm wrong about this. Have just written a test to confirm.
12 is not correct. Section 3.8 of the raft thesis has concluded that it is totally ok to lose the commit index.

@serathius
Copy link
Member

serathius commented Jul 4, 2023

@CaojiamingAlan, what you are describing sounds like an issue with raft implementation. Let's cc the experts
cc @tbg @pavelkalinnikov

I have some old PoC of using LazyFS for robustness tests. However scenario you described requires multiple nodes to down, which robustness testing doesn't support yet. I don't have time this week, but if you want I think you should be able to create an e2e test that simulates this.

@ahrtr
Copy link
Member

ahrtr commented Jul 4, 2023

  1. node 3 restarts, start a new round of election, and is elected as leader. This is legal since the commit index in node 1's WAL is stale, and node 3's commit index is no less than node 1's.

This should not be true. In your example, node 3 isn't able to be elected as the leader, as its local entries lag behind other nodes. Note that when a node campaigns, it uses the lastIndex instead of commitIndex.

@serathius
Copy link
Member

serathius commented Jul 4, 2023

@CaojiamingAlan if you are interested into digging deeper, I encourage you to rerun the Jepsen tests. @endocrimes managed to fix the jepsen docker setup and reproduce this issue. There were a lot of changes to etcd since, including addressing multiple data inconsistencies, so need to confirm the reproduction. Not sure she contributed the docker setup code back. Ofc the main blocker is knowing/learning closure.

@tbg
Copy link
Contributor

tbg commented Jul 5, 2023

Agreed with @ahrtr.

It's useful to make a mental distinction between "known committed" and "actually committed". An entry is committed if it is durable on a quorum of followers. But it may not be known as committed to everyone for some time. The HardState.Commit field tracks known committed, so regressions in it should have no effect on what's actually committed.

Sadly, due to the way etcd-io/raft implements configuration changes, what's "known committed" matters as nodes can1 only activate pending configuration changes that they know are committed.

As for reproducing things like this, I can recommend trying this via TestInteraction in etcd-io/raft which allows writing these interactions in a deterministic unit-test fashion.

Footnotes

  1. I haven't looked into this, but suspect that the official "raft way" of activating them upon receipt in the log might also work, assuming the user integrates around this consistently, but some checks in our library might get in the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release/v3.5 stage/tracked type/bug
Development

No branches or pull requests

7 participants