Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

Closed
cgwalters opened this issue Jul 2, 2020 · 37 comments
Closed

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

cgwalters opened this issue Jul 2, 2020 · 37 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@cgwalters
Copy link
Member

cgwalters commented Jul 2, 2020

See https://bugzilla.redhat.com/show_bug.cgi?id=1850057

(This content is canonically stored at https://github.com/cgwalters/workboard/tree/master/openshift/bz1850057-etcd-osupdate )

OS upgrade I/O competes with etcd

Currently the MCD does:

  • drain
  • apply updates
  • reboot

Now "drain" keeps both daemonsets and static pods running. Of those two, etcd is a static pod today. When we're applying OS updates, that can be a lot of I/O and (reportedly) compete with etcd.

We have two options:

  1. kill etcd after draining (or even more strongly, stop kubelet)
  2. "stage" updates gracefully while everything is running

I like option 2) better because we've put a whole lot of work into making the ostree stack support this "stage updates while system is running" and it'd be cool if OpenShift used it 😄 Another way to say this is - I think we want to minimize the time window in which the etcd cluster is missing a member, so the more work we can do while etcd is still running the better!

Links

Workboard:

Pull request in progress: #1957

Example failing jobs:

Note that both of those jobs jumped from RHEL 8.1 to 8.2.

Prometheus queries

histogram_quantile(0.99, sum(irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)) 
histogram_quantile(0.99, sum(irate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))

@cgwalters
Copy link
Member Author

We also had problems in the past with applying config changes to /etc (including the static pod manifest) while the static pod was still running; I tried looking at git log --grep=etcd but didn't see it. If we change the flow to ensure we're stopping etcd before we apply changes on the filesystem in /etc, that would also have fixed that earlier problem.

(Bigger picture we're being pulled in two different directions here - openshift/enhancements#159 is all about "apply changes without rebooting!" whereas this direction is more "make updates more transactional as part of the reboot!")

@smarterclayton
Copy link
Contributor

Apply updates without rebooting is not a system design goal yet :)

The design goal is “upgrade is completely non disruptive to the system”. Right now it’s possible that static pods like etcd might actually have to be shut down prior to any disruption. Note that it’s not clear what is going on yet.

@cgwalters
Copy link
Member Author

Yep, we're in agreement there! I am hopeful we can mimimize the upgrade disruption - we aren't making any attempt to do so today, and I would be not all surprised to discover the oscontainer/rpm-ostree upgrade bit is disruptive as far as block IO goes.

@cgwalters
Copy link
Member Author

To flesh this out a bit today, we implemented a lot of sophisticated logic as part of zincati that the MCO can use too. The basic thing is - if we happen to be interrupted after "staging" an update, we only want to actually have it take effect on reboot at the very end. If we e.g. fail to drain, or the kernel happens to crash and we get rebooted, we don't want that pending update to take effect!

See coreos/rpm-ostree#1814
and
https://github.com/coreos/zincati/blob/50d51ce97c63fd35fac9bf5c702e4fe299c156c6/src/rpm_ostree/cli_finalize.rs#L21

IOW the flow is like this:

if diff.oscontainer != "" {
  rpm_ostree_upgrade(diff.oscontainer, "--lock-finalization")
}
drain()
write_config_changes(diff.filesystem)
stop_kubelet()
finalize_deployment(diff.oscontainer)

Where only at that very last step do we perform the swap of the bootloader config (and that's basically it).

@hexfusion
Copy link
Contributor

hexfusion commented Jul 7, 2020

Update after triage from etcd team.

It appears that we have a direct correlation to increased I/O latencies leading to possible control-plane disruption and the chain of events that happens during upgrade.

steps performed for data experiment

  • manually created GCP IPI cluster 4.4.11
  • deployed systemd unit for atop monitoring via MachineConfig https://gist.github.com/hexfusion/400d43d511a967814f5ccd8a7e09db8c
  • performed upgrade to 4.5-rc-6
  • to ensure atop time series datafile was preserved over upgrade I copied it to ~root/ right before node reboot.
  • observed metrics from cluster noting timestamp for I/O spike
  • stepped through atop timeseries and observed load during spike.

results as observed from sbatsche-atop10-2020-vrpt6-master-1

  • approx 8:49 p99 fpr etcd metrics etcd_disk_wal_fsync_duration_seconds_bucket and etcd_disk_backend_commit_duration_seconds_bucket spike.

peaking I/O activity during this time series, in general, I/O remains higher

podman pull (update payload pull?)

  • atop 8:48:55 podman pull -q ... starts disks report 61% busy r/0 w/1847 MBw/s 266.3 avio 0.55 ms

kworker/u8:4+fl

  • atop 8:48:27 kworker/u8:4+fl starts disks report 78% busy r/0 w/14900 MBw/s 242.1 avio 0.32 ms

rpm-ostree

  • atop 8:49:21 /usr/bin/rpm-ostree start-daemon starts and instantly disks report 77% busy r/981 w/57 avio 0.74 ms
  • atop 8:49:25 /usr/bin/rpm-ostree start-daemon running disks report 78% busy r/404 w/1234 MBw/s 257.0 avio 0.42 ms
  • atop 8:49:26 /usr/bin/rpm-ostree start-daemon running disks report 91% busy r/853 w/1268 MBw/s 229.7 avio 0.41 ms
  • atop 8:49:27 /usr/bin/rpm-ostree start-daemon running disks report 95% busy r/1162 w/75 MBw/s 229.7 avio 0.69 ms

crio

  • atop 8:49:27 /usr/bin/crio wipe starts disks report 87% busy r/896 w/281 MBw/s 8.9 avio 0.74 ms

kube scheduler

  • atop 8:52:21 kube-scheduler spikes disks report 73% busy r/1124 w/0 MBr/s 27.3 | MBw/s 3.1 avio 0.64 ms

conclusion/open question: can we perform these actions in a way that less disruption to fs.

jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jul 7, 2020
Both `upgrade` and `deploy` already support this. There's no reason why
all the remaining "deployment-creating" commands shouldn't. Prompted by
openshift/machine-config-operator#1897 which
will need this specifically for `rebase`.
@jlebon
Copy link
Member

jlebon commented Jul 7, 2020

coreos/rpm-ostree#2158

openshift-merge-robot pushed a commit to coreos/rpm-ostree that referenced this issue Jul 8, 2020
Both `upgrade` and `deploy` already support this. There's no reason why
all the remaining "deployment-creating" commands shouldn't. Prompted by
openshift/machine-config-operator#1897 which
will need this specifically for `rebase`.
@cgwalters
Copy link
Member Author

OK so I'm playing with things like IOWeight= and IOSchedulingClass=idle etc. (see e.g. https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html http://0pointer.de/blog/projects/resources.html ) and...none of it is having any effect.

I'm playing with this etcd fio job: etcd-io/etcd#10577 (comment)

(Does etcd really use O_DIRECT? I need to verify that)

Also as far as I can tell, we're basically not exposing or using filesystem/block IO cgroups to pods right? (Basically just cpu/memory)
So this issue of balancing etcd with ostree is just a subset of the generic I/O control for workloads. I think this is related to cgroups v2: https://andrestc.com/post/cgroups-io/ - definitely for ostree we use buffered writes. We may have to just manually tune down our write speed.

@mrunalp
Copy link
Member

mrunalp commented Jul 9, 2020

We have discussed throttling image pulls as well in podman/crio till we can use cgroups v2 features.

cgwalters added a commit to cgwalters/ostree that referenced this issue Jul 9, 2020
We need to expose more control over how fast we write to disk
in order to allow OS vendors to more finely tune latency vs
throughput for OS updates.

See openshift/machine-config-operator#1897

This is a crude hack I'm using for testing; need to also
expose this via API, document it etc.
@cgwalters
Copy link
Member Author

I am working on a writeup around this in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA

@jlebon
Copy link
Member

jlebon commented Jul 10, 2020

Run a HTTP server (kube service) that mounts the oscontainer

This makes a lot of sense to me and clearly scales better with the size of the cluster. (We could even go fancier in the future and have the service generate static deltas to trade CPU cycles on one node for even more I/O efficiency across the cluster; and at least CPU cycles are more easily accountable/schedulable.)

@jlebon
Copy link
Member

jlebon commented Jul 10, 2020

Another advantage of that approach is that given that network speed is normally slower than disk speed, you also automatically get less demanding disk I/O usage.

cgwalters added a commit to cgwalters/ostree that referenced this issue Jul 15, 2020
We need to expose more control over how fast we write to disk
in order to allow OS vendors to more finely tune latency vs
throughput for OS updates.

See openshift/machine-config-operator#1897

This `OSTREE_PULL_MAX_OUTSTANDING_WRITES` is a crude hack
I'm using for testing; need to also expose this via API, document it etc.

Second, add a `per-object-fsync` option which goes back to invoking
`fsync()` on each object just after writing it.

Combined, these two ensure natural "backpressure" against trying
to fsync a huge amount of data all at once, which I believe is
what is leading to the huge etcd fsync latency.
@cgwalters
Copy link
Member Author

More updates in the doc but to summarize: It seems that what ostree is doing today can introduce e.g. 2 second latency for etcd fdatasync. We can rework things so we fsync as we go, avoiding a single big flush. This slows down the update a lot, but also greatly reduces the max latency (from ~2s to .46s).

In other words, if we take this from:

  fsync/fdatasync/sync_file_range:
    sync (usec): min=732, max=1935.7k, avg=2594.14, stdev=19014.96
    sync percentiles (usec):
     |  1.00th=[   832],  5.00th=[   881], 10.00th=[   914], 20.00th=[   971],
     | 30.00th=[  1020], 40.00th=[  1090], 50.00th=[  2933], 60.00th=[  3032],
     | 70.00th=[  3064], 80.00th=[  3130], 90.00th=[  3228], 95.00th=[  3425],
     | 99.00th=[  6194], 99.50th=[  7963], 99.90th=[ 17433], 99.95th=[ 95945],
     | 99.99th=[843056]

to:

  fsync/fdatasync/sync_file_range:
    sync (usec): min=741, max=461949, avg=3970.41, stdev=6696.28
    sync percentiles (usec):
     |  1.00th=[   824],  5.00th=[   898], 10.00th=[   947], 20.00th=[  1029],
     | 30.00th=[  2073], 40.00th=[  2868], 50.00th=[  3032], 60.00th=[  3163],
     | 70.00th=[  4113], 80.00th=[  6194], 90.00th=[  8586], 95.00th=[ 10028],
     | 99.00th=[ 13042], 99.50th=[ 15139], 99.90th=[ 35390], 99.95th=[ 87557],
     | 99.99th=[459277]

is that sufficiently better to call this "fixed"?

@cgwalters
Copy link
Member Author

cgwalters commented Jul 15, 2020

Now that I look more closely we've mostly swapped having a huge stdev (and maximum) for a much worse average, and actually pushing the 99th percentile over 10ms. Argh.

@hexfusion
Copy link
Contributor

TBH the 10ms is more or less arbitrary in this case. A comparison where we felt this threshold held true was with clusters without load. So don't let these results dampen hope.

@cgwalters
Copy link
Member Author

Can you speak to how bad would be those potential huge 2s stalls for etcd be versus a longer period of increased latency? Does it sound likely that having huge spikes or two like that is the root cause for disruption?

@hexfusion
Copy link
Contributor

when we are looking at "why are we doing this", "why does it matter" we need to consider the following. The latency on disk is directly related to election timeouts (leader elections) as the time in which it takes to persist raft transactions to the WAL file is very high. So if we go to 2s fsync with for example Azure, where the election timeout is 2500ms we are pushing the cluster towards timeout.

Where if instead, we have much lower yet elevated latency even over a longer time series we could see api with some slowness but chances are far less we introduce election as a result. So the net gain is less disruption. Also consider the operators who use leader election. They must update TTL every 10s or go through leader election process. So if etcd has huge spike in latency it could result in an operator perhaps not being able to update configmap in time.

In short its easier for etcd to tolerate elevated latency over a period of time vs very heavy spike.

@cgwalters
Copy link
Member Author

cgwalters commented Jul 15, 2020

Here's another sub-thread though: Since we know we're very shortly going to be rebooting this machine, would it help to tell etcd "if this node is a leader, please re-elect a different one"? Or to rephrase, are latency spikes much worse of a problem on the leader?

@ironcladlou
Copy link
Contributor

Here's another sub-thread though: Since we know we're very shortly going to be rebooting this machine, would it help to tell etcd "if this node is a leader, please re-elect a different one"? Or to rephrase, are latency spikes much worse of a problem on the leader?

This is a good question, I think it has been discussed before elsewhere in more depth. I think @hexfusion has some input here re: actively designating a leader intelligently (if we can predict the reboot order)

@cgwalters
Copy link
Member Author

Moved that to openshift/cluster-etcd-operator#392

@hexfusion
Copy link
Contributor

Yeah the net result of rolling reboots will be leader change right it is going to happen, leader will go down. When etcd gets SIGTERM it will try to gracefully transfer leadership vs election. But the result is still "leader change" which results in a fail-fast to clients to retry (TTL resets etc). Because technically there is a time where we have no leader even if just a few ms. So while we expect 1 leader change what we see as a result here can be multiple.

@cgwalters
Copy link
Member Author

cgwalters commented Jul 15, 2020

So while we expect 1 leader change what we see as a result here can be multiple.

Hmm...actually there can be anywhere between 1 and 3 elections - if the MCO happens to choose the leader each time to upgrade. Right? So I think perhaps instead of openshift/cluster-etcd-operator#392 the MCO should prefer upgrading followers first...right?

@hexfusion
Copy link
Contributor

Hmm...actually there can be anywhere between 1 and 3 elections - if the MCO happens to choose the leader each time to upgrade. Right? So I think perhaps instead of openshift/cluster-etcd-operator#392 the MCO should prefer upgrading followers first...right?

I don't disagree that chance is currently involved. Logically choosing non-leader to upgrade first would help hedge those bets.

@cgwalters
Copy link
Member Author

I don't disagree that chance is currently involved. Logically choosing non-leader to upgrade first would help hedge those bets.

I can say for sure that the MCO code today makes no attempt to apply any kind of ordering/priority to the nodes it's going to update. It could in fact be effectively random.

Adding a little bit of logic here for the control plane to avoid the etcd leader should be pretty easy...is the current leader reflected anywhere in the API? I am not seeing it in oc describe clusteroperator/etcd or oc get etcd etc.

cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Aug 5, 2020
We have "real" OS update tests which look at a previous build;
this is generally good, but I want to be able to reliably
test e.g. a "large" upgrade in some CI scenarios, and it's OK
if the upgrade isn't "real".

This command takes an ostree commit and adds a note to a percentage
of ELF binaries.  This way one can generate a "large" update by
specifying e.g. `--percentage=80` or so.

I plan to use this for testing etcd performance during
large updates; see
openshift/machine-config-operator#1897
cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Aug 5, 2020
We have "real" OS update tests which look at a previous build;
this is generally good, but I want to be able to reliably
test e.g. a "large" upgrade in some CI scenarios, and it's OK
if the upgrade isn't "real".

This command takes an ostree commit and adds a note to a percentage
of ELF binaries.  This way one can generate a "large" update by
specifying e.g. `--percentage=80` or so.

I plan to use this for testing etcd performance during
large updates; see
openshift/machine-config-operator#1897
cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Aug 6, 2020
Part of solving openshift#1897
A lot more details in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA

The TL;DR is that the `bfq` I/O scheduler better respects IO priorities,
and also does a better job of handling latency sensitive processes
like `etcd` versus bulk/background I/O .
cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Aug 6, 2020
We switched rpm-ostree to do this when applying updates, but
it also makes sense to do when extracting the oscontainer.

Part of: openshift#1897
Which is about staging OS updates more nicely when etcd is running.
cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Aug 6, 2020
The way we're talking to etcd is a bit hacky, I ended up cargo
culting some code.  This would be much cleaner if the etcd operator
did it.

But it's critical that we update the etcd followers first, because
leader elections are disruptive events and we can easily minimize
that.

Closes: openshift#1897
@ironcladlou
Copy link
Contributor

The tricky thing is: it seems useful to measure success (at least partially) by number of leader elections. And if we do that we need the MCO to be doing the right thing in the first place.

Anyways like I said, I think we can just hack this into the MCO code directly for now, so we're not blocked.

Would it help if we were able to see the p99 fsync and commit metrics aggregated across a meaningfully sized set of CI runs that included the nontrivial OS upgrade? Absolute election count is one important metric, but it seems to me that we're also interested in knowing how much headroom we're providing to assess the risk of inducing election cycles in the near future even if an election didn't actually happen during the test.

Can/should we come up with some clear SLOs against these metrics for some specific test suites?

cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Aug 20, 2020
We have "real" OS update tests which look at a previous build;
this is generally good, but I want to be able to reliably
test e.g. a "large" upgrade in some CI scenarios, and it's OK
if the upgrade isn't "real".

This command takes an ostree commit and adds a note to a percentage
of ELF binaries.  This way one can generate a "large" update by
specifying e.g. `--percentage=80` or so.

I plan to use this for testing etcd performance during
large updates; see
openshift/machine-config-operator#1897
cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Aug 25, 2020
Part of solving openshift#1897
A lot more details in https://hackmd.io/WeqiDWMAQP2sNtuPRul9QA

The TL;DR is that the `bfq` I/O scheduler better respects IO priorities,
and also does a better job of handling latency sensitive processes
like `etcd` versus bulk/background I/O .
cgwalters added a commit to cgwalters/machine-config-operator that referenced this issue Aug 25, 2020
We switched rpm-ostree to do this when applying updates, but
it also makes sense to do when extracting the oscontainer.

Part of: openshift#1897
Which is about staging OS updates more nicely when etcd is running.
cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Aug 30, 2020
We have "real" OS update tests which look at a previous build;
this is generally good, but I want to be able to reliably
test e.g. a "large" upgrade in some CI scenarios, and it's OK
if the upgrade isn't "real".

This command takes an ostree commit and adds a note to a percentage
of ELF binaries.  This way one can generate a "large" update by
specifying e.g. `--percentage=80` or so.

Building on that, also add a wrapper which generates an update
from an oscontainer.

I plan to use this for testing etcd performance during
large updates; see
openshift/machine-config-operator#1897
cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Aug 30, 2020
We have "real" OS update tests which look at a previous build;
this is generally good, but I want to be able to reliably
test e.g. a "large" upgrade in some CI scenarios, and it's OK
if the upgrade isn't "real".

This command takes an ostree commit and adds a note to a percentage
of ELF binaries.  This way one can generate a "large" update by
specifying e.g. `--percentage=80` or so.

Building on that, also add a wrapper which generates an update
from an oscontainer.

I plan to use this for testing etcd performance during
large updates; see
openshift/machine-config-operator#1897
openshift-merge-robot pushed a commit to coreos/coreos-assembler that referenced this issue Sep 8, 2020
We have "real" OS update tests which look at a previous build;
this is generally good, but I want to be able to reliably
test e.g. a "large" upgrade in some CI scenarios, and it's OK
if the upgrade isn't "real".

This command takes an ostree commit and adds a note to a percentage
of ELF binaries.  This way one can generate a "large" update by
specifying e.g. `--percentage=80` or so.

Building on that, also add a wrapper which generates an update
from an oscontainer.

I plan to use this for testing etcd performance during
large updates; see
openshift/machine-config-operator#1897
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 19, 2020
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 19, 2020
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
8 participants