Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETCD Unstable under high sustained load #14072

Closed
JamesMurkin opened this issue May 25, 2022 · 2 comments
Closed

ETCD Unstable under high sustained load #14072

JamesMurkin opened this issue May 25, 2022 · 2 comments

Comments

@JamesMurkin
Copy link

What happened?

I ran while [ 1 ]; do dd if=/dev/urandom bs=1024 count=1024 | etcdctl put key || break; done from 2 machines against my etcd cluster and it became unstable.

Client side I get:
"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Error: context deadline exceeded

Etcd side the whole etcd cluster becomes unstable for 1-2 minutes, then a selection of things can happen:

  • Full recovery, all nodes rejoin
  • 2 nodes rejoin and become stable, one node gets into a broken state and never rejoins
  • No nodes rejoin and it remains offline

What did you expect to happen?

I expected ETCD to gracefully handle this. Either by rate limiting the client or at least always fully recovering when it does become overloaded.

Having the whole cluster become indefinitely unstable isn't very desirable

How can we reproduce it (as minimally and precisely as possible)?

I used the following machines (all VMs):

  • 3 ETCD nodes - 8 core, 32Gi Ram, 50Gi disk
  • 2 Kubernetes master nodes - 36 core, 125 Gi Ram, 120Gi disk
  • Kube-api etcdCompactionInterval=2m30s

I ran the equivalent of (I had to pass various flags to the certs + addresses):

  • while [ 1 ]; do dd if=/dev/urandom bs=1024 count=1024 | etcdctl put key || break; done

From both masters.

  • Some notes here. To prevent hitting the etcd limit, sometimes I have to manually stop running it on one master until we hit a compaction period and then start it again
  • You can get the timeout even with a single master, but usually etcd recovers gracefully from this. I used 2 as it causes the issue faster

Anything else we need to know?

I originally was wanting to test how etcd handled having values of --quota-backend-bytes when I ran into this

  • I was trying to fill the database to see if performance will still good when it was nearly full + measure recovery time

The easiest way I've had to replicate is to set --quota-backend-bytes to 25 Gi, turn off compaction on k8s, and run the above commands

  • I never got it to hit the limit --quota-backend-bytes as it'd always break the cluster before getting that far

However I have replicated with --quota-backend-bytes set to 8Gi, it just required a bit more effort to make sure you don't hit the limit (as described above) but still regularly occurs within 15 minutes. It seems harder to have it fail badly at the 8Gi limit, but not impossible.

Generally it seems the histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (le)) just continues to increase until it hits an inflection point and falls over

Etcd version (please run commands below)

We are using v3.5.3 from https://quay.io/repository/coreos/etcd?tab=tags&tag=latest

$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5

Etcd configuration (command line flags or environment variables)

We just use the standard entrypoint for that image mentioned above - no extra command line flags

I'll need to find if I can share this, but it is pretty standard configuration, shout if this is needed.

I guess main difference is:
ETCD_QUOTA_BACKEND_BYTES=8589934592

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

I'll need to find if I can share this, but it is pretty standard configuration, shout if this is needed.

Relevant log output

No response

@chaochn47
Copy link
Member

@JamesMurkin would you mind provide some etcd log files when the issue happened?

@stale
Copy link

stale bot commented Dec 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 31, 2022
@stale stale bot closed this as completed Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants