Inconsistent revision and data after --force-new-cluster #14009

brandond · 2022-05-04T23:26:32Z

What happened?

After starting etcd with --force-new-cluster, removing the database files off the secondary nodes, and rejoining them to the cluster, the cluster is now in a split-brain state. Reads from the 1st node (that was started with --force-new-cluster) returns different data for some keys than reads from nodes that were deleted and subsequently rejoined to the cluster.

The end result feels identical to #13766, but can be reproduced with a fairly trivial amount of traffic in conjunction with using --force-new-cluster.

Examining the database from the nodes with etcd-dump-db and etc-dump logs show the same event sequence in the WAL, but the db itself shows different values in the keystore. I'm not pasting the WAL dump here but will attach the data-dir from both cluster members.

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 key | grep /registry/services/endpoints/default/kubernetes
key="\x00\x00\x00\x00\x00\x00\nR_\x00\x00\x00\x00\x00\x00\x00\x00", value="\n//registry/services/endpoints/default/kubernetes\x10\xd2\x01\x18\xd2\x14 \x04*\xe2\x02k8s\x00\n\x0f\n\x02v1\x12\tEndpoints\x12\xc6\x02\n\x9c\x02\n\nkubernetes\x12\x00\x1a\adefault\"\x00*$fa88d66d-7bdc-4302-bc82-6c850ff4b85e2\x008\x00B\b\b\x98\xde˓\x06\x10\x00Z/\n'endpointslice.kubernetes.io/skip-mirror\x12\x04truez\x00\x8a\x01\x98\x01\n\x0ekube-apiserver\x12\x06Update\x1a\x02v1\"\b\b\x98\xde˓\x06\x10\x002\bFieldsV1:d\nb{\"f:metadata\":{\"f:labels\":{\".\":{},\"f:endpointslice.kubernetes.io/skip-mirror\":{}}},\"f:subsets\":{}}B\x00\x12%\n\x12\n\x0e18.219.153.245\x1a\x00\x1a\x0f\n\x05https\x10\xab2\x1a\x03TCP\x1a\x00\"\x00"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 key | grep /registry/services/endpoints/default/kubernetes
key="\x00\x00\x00\x00\x00\x00\f\xb9_\x00\x00\x00\x00\x00\x00\x00\x00", value="\n//registry/services/endpoints/default/kubernetes\x10\xd2\x01\x18\xb9\x19 \a*\xe2\x02k8s\x00\n\x0f\n\x02v1\x12\tEndpoints\x12\xc6\x02\n\x9c\x02\n\nkubernetes\x12\x00\x1a\adefault\"\x00*$fa88d66d-7bdc-4302-bc82-6c850ff4b85e2\x008\x00B\b\b\x98\xde˓\x06\x10\x00Z/\n'endpointslice.kubernetes.io/skip-mirror\x12\x04truez\x00\x8a\x01\x98\x01\n\x0ekube-apiserver\x12\x06Update\x1a\x02v1\"\b\b\x98\xde˓\x06\x10\x002\bFieldsV1:d\nb{\"f:metadata\":{\"f:labels\":{\".\":{},\"f:endpointslice.kubernetes.io/skip-mirror\":{}}},\"f:subsets\":{}}B\x00\x12%\n\x12\n\x0e18.219.188.103\x1a\x00\x1a\x0f\n\x05https\x10\xab2\x1a\x03TCP\x1a\x00\"\x00"

Also,the datastore for both nodes shows different values for the members and members_removed keys. I'm not sure if this is normal or not:

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members_removed
key="77aa3673d9e0e2", value="removed"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members
key="77aa3673d9e0e2", value="{\"id\":33682673077182690,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members_removed

brandond@dev01:~/etcd-split-brain$

What did you expect to happen?

Consistent data returned by both cluster members.

How can we reproduce it (as minimally and precisely as possible)?

Start etcd with --force-new-cluster while a running Kubernetes apiserver is pointed at the etcd server. I have not been able to reproduce this ad-hoc with direct writes to a single key.

Anything else we need to know?

No response

Etcd version (please run commands below)

ubuntu@ip-172-31-17-205:~$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64

ubuntu@ip-172-31-17-205:~$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5

Etcd configuration (command line flags or environment variables)

advertise-client-urls: https://172.31.17.205:2379
client-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
data-dir: /var/lib/rancher/rke2/server/db/etcd
election-timeout: 5000
heartbeat-interval: 500
initial-cluster: ip-172-31-30-121-0cbb6287=https://172.31.30.121:2380,ip-172-31-17-205-72ea150e=https://172.31.17.205:2380
initial-cluster-state: existing
listen-client-urls: https://172.31.17.205:2379,https://127.0.0.1:2379
listen-metrics-urls: http://127.0.0.1:2381
listen-peer-urls: https://172.31.17.205:2380
log-outputs:
- stderr
logger: zap
name: ip-172-31-17-205-72ea150e
peer-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

data-dir from both cluster members:
etcd-split-brain.zip

The text was updated successfully, but these errors were encountered:

brandond · 2022-05-05T21:29:43Z

I've done some additional hacking at this an eliminated a couple possibilities:

ensuring that there are no etcd clients (kubernetes apiserver, for example) or other peer nodes running when the node is restarted with -force-new-cluster does not make any difference.
Enabling startup and periodic corruption checks does not show anything interesting

However, one interesting finding is that it appears to be possible to take a snapshot from any node in an affected cluster, restore that snapshot, start with --force-new-cluster, and then rejoin the other nodes, and the cluster now returns consistent results.

ahrtr · 2022-05-06T01:52:24Z

I did not reproduce this issue with 3.5.4. Please provide the detailed steps and the command you executed in each step.

brandond · 2022-05-06T02:00:39Z

I have not been able to reproduce it without a Kubernetes cluster pointed at the etcd cluster. Not sure if it is a load issue, or something to do with the way Kubernetes uses transactions for its create/update operations. Is there a suggested load simulation tool that I could test with?

ahrtr · 2022-05-08T01:12:15Z

I have not been able to reproduce it without a Kubernetes cluster pointed at the etcd cluster. Not sure if it is a load issue, or something to do with the way Kubernetes uses transactions for its create/update operations. Is there a suggested load simulation tool that I could test with?

You can try the benchmark tool. See an example command below,

./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=500000000 --key-size=128 --val-size=10240  --total=100000 --rate=10000

ahrtr · 2022-09-08T08:50:48Z

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members
key="3e8594789d62d712", value="{"id":4505170248011536146,"peerURLs":["https://172.31.17.205:2380"],"isLearner":true}"
key="3c0e71035ef2e3ca", value="{"id":4327520551241442250,"peerURLs":["https://172.31.30.121:2380"],"name":"ip-172-31-30-121-53c44a92"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members_removed
key="77aa3673d9e0e2", value="removed"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members
key="77aa3673d9e0e2", value="{\"id\":33682673077182690,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members_removed

brandond@dev01:~/etcd-split-brain$

It's interesting that there are two learners on etcd-2 and one learner on etcd-1. Could you provide detailed steps (and with command) on how to reproduce this issue?

brandond · 2022-09-08T10:27:35Z

I really wish I knew how to reproduce it using just etcd and a bare etcd3 client or the benchmark tool. At the moment I can only reproduce it when I have a Kubernetes cluster pointed at etcd, and use the --force-new-cluster option to reset the cluster membership back to a single node.

stale · 2022-12-31T23:10:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

brandond added the type/bug label May 4, 2022

serathius mentioned this issue Jun 21, 2022

Plans for v3.5.5 release #14138

Closed

16 tasks

serathius added the release/v3.5 label Sep 7, 2022

brandond mentioned this issue Sep 30, 2022

Unable to add node to cluster after cluster-reset k3s-io/k3s#6186

Closed

stale bot added the stale label Dec 31, 2022

stale bot closed this as completed Apr 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent revision and data after --force-new-cluster #14009

Inconsistent revision and data after --force-new-cluster #14009

brandond commented May 4, 2022 •

edited

Loading

brandond commented May 5, 2022

ahrtr commented May 6, 2022

brandond commented May 6, 2022 •

edited

Loading

ahrtr commented May 8, 2022

ahrtr commented Sep 8, 2022

brandond commented Sep 8, 2022 •

edited

Loading

stale bot commented Dec 31, 2022

Inconsistent revision and data after --force-new-cluster #14009

Inconsistent revision and data after --force-new-cluster #14009

Comments

brandond commented May 4, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

brandond commented May 5, 2022

ahrtr commented May 6, 2022

brandond commented May 6, 2022 • edited Loading

ahrtr commented May 8, 2022

ahrtr commented Sep 8, 2022

brandond commented Sep 8, 2022 • edited Loading

stale bot commented Dec 31, 2022

brandond commented May 4, 2022 •

edited

Loading

brandond commented May 6, 2022 •

edited

Loading

brandond commented Sep 8, 2022 •

edited

Loading