Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent revision and data after --force-new-cluster #14009

Closed
brandond opened this issue May 4, 2022 · 7 comments
Closed

Inconsistent revision and data after --force-new-cluster #14009

brandond opened this issue May 4, 2022 · 7 comments

Comments

@brandond
Copy link
Contributor

brandond commented May 4, 2022

What happened?

After starting etcd with --force-new-cluster, removing the database files off the secondary nodes, and rejoining them to the cluster, the cluster is now in a split-brain state. Reads from the 1st node (that was started with --force-new-cluster) returns different data for some keys than reads from nodes that were deleted and subsequently rejoined to the cluster.

The end result feels identical to #13766, but can be reproduced with a fairly trivial amount of traffic in conjunction with using --force-new-cluster.

Examining the database from the nodes with etcd-dump-db and etc-dump logs show the same event sequence in the WAL, but the db itself shows different values in the keystore. I'm not pasting the WAL dump here but will attach the data-dir from both cluster members.

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 key | grep /registry/services/endpoints/default/kubernetes
key="\x00\x00\x00\x00\x00\x00\nR_\x00\x00\x00\x00\x00\x00\x00\x00", value="\n//registry/services/endpoints/default/kubernetes\x10\xd2\x01\x18\xd2\x14 \x04*\xe2\x02k8s\x00\n\x0f\n\x02v1\x12\tEndpoints\x12\xc6\x02\n\x9c\x02\n\nkubernetes\x12\x00\x1a\adefault\"\x00*$fa88d66d-7bdc-4302-bc82-6c850ff4b85e2\x008\x00B\b\b\x98\xde˓\x06\x10\x00Z/\n'endpointslice.kubernetes.io/skip-mirror\x12\x04truez\x00\x8a\x01\x98\x01\n\x0ekube-apiserver\x12\x06Update\x1a\x02v1\"\b\b\x98\xde˓\x06\x10\x002\bFieldsV1:d\nb{\"f:metadata\":{\"f:labels\":{\".\":{},\"f:endpointslice.kubernetes.io/skip-mirror\":{}}},\"f:subsets\":{}}B\x00\x12%\n\x12\n\x0e18.219.153.245\x1a\x00\x1a\x0f\n\x05https\x10\xab2\x1a\x03TCP\x1a\x00\"\x00"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 key | grep /registry/services/endpoints/default/kubernetes
key="\x00\x00\x00\x00\x00\x00\f\xb9_\x00\x00\x00\x00\x00\x00\x00\x00", value="\n//registry/services/endpoints/default/kubernetes\x10\xd2\x01\x18\xb9\x19 \a*\xe2\x02k8s\x00\n\x0f\n\x02v1\x12\tEndpoints\x12\xc6\x02\n\x9c\x02\n\nkubernetes\x12\x00\x1a\adefault\"\x00*$fa88d66d-7bdc-4302-bc82-6c850ff4b85e2\x008\x00B\b\b\x98\xde˓\x06\x10\x00Z/\n'endpointslice.kubernetes.io/skip-mirror\x12\x04truez\x00\x8a\x01\x98\x01\n\x0ekube-apiserver\x12\x06Update\x1a\x02v1\"\b\b\x98\xde˓\x06\x10\x002\bFieldsV1:d\nb{\"f:metadata\":{\"f:labels\":{\".\":{},\"f:endpointslice.kubernetes.io/skip-mirror\":{}}},\"f:subsets\":{}}B\x00\x12%\n\x12\n\x0e18.219.188.103\x1a\x00\x1a\x0f\n\x05https\x10\xab2\x1a\x03TCP\x1a\x00\"\x00"

Also,the datastore for both nodes shows different values for the members and members_removed keys. I'm not sure if this is normal or not:

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members_removed
key="77aa3673d9e0e2", value="removed"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members
key="77aa3673d9e0e2", value="{\"id\":33682673077182690,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members_removed

brandond@dev01:~/etcd-split-brain$

What did you expect to happen?

Consistent data returned by both cluster members.

How can we reproduce it (as minimally and precisely as possible)?

Start etcd with --force-new-cluster while a running Kubernetes apiserver is pointed at the etcd server. I have not been able to reproduce this ad-hoc with direct writes to a single key.

Anything else we need to know?

No response

Etcd version (please run commands below)

ubuntu@ip-172-31-17-205:~$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64

ubuntu@ip-172-31-17-205:~$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5

Etcd configuration (command line flags or environment variables)

advertise-client-urls: https://172.31.17.205:2379
client-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt
data-dir: /var/lib/rancher/rke2/server/db/etcd
election-timeout: 5000
heartbeat-interval: 500
initial-cluster: ip-172-31-30-121-0cbb6287=https://172.31.30.121:2380,ip-172-31-17-205-72ea150e=https://172.31.17.205:2380
initial-cluster-state: existing
listen-client-urls: https://172.31.17.205:2379,https://127.0.0.1:2379
listen-metrics-urls: http://127.0.0.1:2381
listen-peer-urls: https://172.31.17.205:2380
log-outputs:
- stderr
logger: zap
name: ip-172-31-17-205-72ea150e
peer-transport-security:
  cert-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.crt
  client-cert-auth: true
  key-file: /var/lib/rancher/rke2/server/tls/etcd/peer-server-client.key
  trusted-ca-file: /var/lib/rancher/rke2/server/tls/etcd/peer-ca.crt

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

data-dir from both cluster members:
etcd-split-brain.zip

@brandond
Copy link
Contributor Author

brandond commented May 5, 2022

I've done some additional hacking at this an eliminated a couple possibilities:

  • ensuring that there are no etcd clients (kubernetes apiserver, for example) or other peer nodes running when the node is restarted with -force-new-cluster does not make any difference.
  • Enabling startup and periodic corruption checks does not show anything interesting

However, one interesting finding is that it appears to be possible to take a snapshot from any node in an affected cluster, restore that snapshot, start with --force-new-cluster, and then rejoin the other nodes, and the cluster now returns consistent results.

@ahrtr
Copy link
Member

ahrtr commented May 6, 2022

I did not reproduce this issue with 3.5.4. Please provide the detailed steps and the command you executed in each step.

@brandond
Copy link
Contributor Author

brandond commented May 6, 2022

I have not been able to reproduce it without a Kubernetes cluster pointed at the etcd cluster. Not sure if it is a load issue, or something to do with the way Kubernetes uses transactions for its create/update operations. Is there a suggested load simulation tool that I could test with?

@ahrtr
Copy link
Member

ahrtr commented May 8, 2022

I have not been able to reproduce it without a Kubernetes cluster pointed at the etcd cluster. Not sure if it is a load issue, or something to do with the way Kubernetes uses transactions for its create/update operations. Is there a suggested load simulation tool that I could test with?

You can try the benchmark tool. See an example command below,

./benchmark txn-put --endpoints="http://127.0.0.1:2379" --clients=200 --conns=200 --key-space-size=500000000 --key-size=128 --val-size=10240  --total=100000 --rate=10000

@ahrtr
Copy link
Member

ahrtr commented Sep 8, 2022

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members
key="3e8594789d62d712", value="{"id":4505170248011536146,"peerURLs":["https://172.31.17.205:2380"],"isLearner":true}"
key="3c0e71035ef2e3ca", value="{"id":4327520551241442250,"peerURLs":["https://172.31.30.121:2380"],"name":"ip-172-31-30-121-53c44a92"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-1 members_removed
key="77aa3673d9e0e2", value="removed"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members
key="77aa3673d9e0e2", value="{\"id\":33682673077182690,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3e8594789d62d712", value="{\"id\":4505170248011536146,\"peerURLs\":[\"https://172.31.17.205:2380\"],\"isLearner\":true}"
key="3c0e71035ef2e3ca", value="{\"id\":4327520551241442250,\"peerURLs\":[\"https://172.31.30.121:2380\"],\"name\":\"ip-172-31-30-121-53c44a92\"}"

brandond@dev01:~/etcd-split-brain$ etcd-dump-db iterate-bucket etcd-2 members_removed

brandond@dev01:~/etcd-split-brain$

It's interesting that there are two learners on etcd-2 and one learner on etcd-1. Could you provide detailed steps (and with command) on how to reproduce this issue?

@brandond
Copy link
Contributor Author

brandond commented Sep 8, 2022

I really wish I knew how to reproduce it using just etcd and a bare etcd3 client or the benchmark tool. At the moment I can only reproduce it when I have a Kubernetes cluster pointed at etcd, and use the --force-new-cluster option to reset the cluster membership back to a single node.

@stale
Copy link

stale bot commented Dec 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 31, 2022
@stale stale bot closed this as completed Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants