ETCD Unstable under high sustained load #14072

JamesMurkin · 2022-05-25T14:12:33Z

What happened?

I ran while [ 1 ]; do dd if=/dev/urandom bs=1024 count=1024 | etcdctl put key || break; done from 2 machines against my etcd cluster and it became unstable.

Client side I get:
"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} Error: context deadline exceeded

Etcd side the whole etcd cluster becomes unstable for 1-2 minutes, then a selection of things can happen:

Full recovery, all nodes rejoin
2 nodes rejoin and become stable, one node gets into a broken state and never rejoins
No nodes rejoin and it remains offline

What did you expect to happen?

I expected ETCD to gracefully handle this. Either by rate limiting the client or at least always fully recovering when it does become overloaded.

Having the whole cluster become indefinitely unstable isn't very desirable

How can we reproduce it (as minimally and precisely as possible)?

I used the following machines (all VMs):

3 ETCD nodes - 8 core, 32Gi Ram, 50Gi disk
2 Kubernetes master nodes - 36 core, 125 Gi Ram, 120Gi disk
Kube-api etcdCompactionInterval=2m30s

I ran the equivalent of (I had to pass various flags to the certs + addresses):

while [ 1 ]; do dd if=/dev/urandom bs=1024 count=1024 | etcdctl put key || break; done

From both masters.

Some notes here. To prevent hitting the etcd limit, sometimes I have to manually stop running it on one master until we hit a compaction period and then start it again
You can get the timeout even with a single master, but usually etcd recovers gracefully from this. I used 2 as it causes the issue faster

Anything else we need to know?

I originally was wanting to test how etcd handled having values of --quota-backend-bytes when I ran into this

I was trying to fill the database to see if performance will still good when it was nearly full + measure recovery time

The easiest way I've had to replicate is to set --quota-backend-bytes to 25 Gi, turn off compaction on k8s, and run the above commands

I never got it to hit the limit --quota-backend-bytes as it'd always break the cluster before getting that far

However I have replicated with --quota-backend-bytes set to 8Gi, it just required a bit more effort to make sure you don't hit the limit (as described above) but still regularly occurs within 15 minutes. It seems harder to have it fail badly at the 8Gi limit, but not impossible.

Generally it seems the histogram_quantile(0.99, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (le)) just continues to increase until it hits an inflection point and falls over

Etcd version (please run commands below)

We are using v3.5.3 from https://quay.io/repository/coreos/etcd?tab=tags&tag=latest

$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5

Etcd configuration (command line flags or environment variables)

We just use the standard entrypoint for that image mentioned above - no extra command line flags

I'll need to find if I can share this, but it is pretty standard configuration, shout if this is needed.

I guess main difference is:
ETCD_QUOTA_BACKEND_BYTES=8589934592

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

I'll need to find if I can share this, but it is pretty standard configuration, shout if this is needed.

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

chaochn47 · 2022-06-25T00:42:20Z

@JamesMurkin would you mind provide some etcd log files when the issue happened?

stale · 2022-12-31T23:11:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

JamesMurkin added the type/bug label May 25, 2022

serathius mentioned this issue Jun 21, 2022

Plans for v3.5.5 release #14138

Closed

16 tasks

serathius added the release/v3.5 label Sep 7, 2022

stale bot added the stale label Dec 31, 2022

stale bot closed this as completed Apr 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCD Unstable under high sustained load #14072

ETCD Unstable under high sustained load #14072

JamesMurkin commented May 25, 2022

chaochn47 commented Jun 25, 2022

stale bot commented Dec 31, 2022

ETCD Unstable under high sustained load #14072

ETCD Unstable under high sustained load #14072

Comments

JamesMurkin commented May 25, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

chaochn47 commented Jun 25, 2022

stale bot commented Dec 31, 2022