Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FCOS upgrade tests often flaking on AWS #1301

Closed
jlebon opened this issue Mar 31, 2020 · 14 comments
Closed

FCOS upgrade tests often flaking on AWS #1301

jlebon opened this issue Mar 31, 2020 · 14 comments

Comments

@jlebon
Copy link
Member

jlebon commented Mar 31, 2020

We're seeing cosa kola --upgrades flake pretty often on AWS:

=== RUN   fcos.upgrade.basic
=== RUN   fcos.upgrade.basic/upgrade-from-current
--- FAIL: fcos.upgrade.basic (214.28s)
    --- PASS: fcos.upgrade.basic/setup (15.13s)
    --- FAIL: fcos.upgrade.basic/upgrade-from-previous (120.47s)
            basic.go:228: failed waiting for machine reboot: timed out after 2m0s waiting for machine to reboot
    --- SKIP: fcos.upgrade.basic/upgrade-from-current (0.00s)
            cluster.go:50: A previous test has already failed
FAIL, output in tmp/kola-upgrade

At first I thought it was due to AWS just taking longer to reboot. But the logs show that rpm-ostree is actually failing:

2321   Mar 31 06:51:30.767361 zincati[1346]: [TRACE] request to stage release: Release { version: "31.20200331.94.0", checksum: "1cee7b6fda43ac7a22ce1a76feb50cb530fa91e8f5a8f1004e1c667ecf91bb7d", age_index: Some(1) }
2325   Mar 31 06:51:55.803428 zincati[1346]: [ERROR] failed to stage deployment: rpm-ostree deploy failed:
2326   Mar 31 06:51:55.803428 zincati[1346]:     error: Timeout was reached
2327   Mar 31 06:51:55.803428 zincati[1346]:
2328   Mar 31 06:51:55.804386 zincati[1346]: [TRACE] scheduling next agent refresh in 312 seconds
2330   Mar 31 06:52:06.866091 rpm-ostree[916]: Failed to GetConnectionUnixUser for client :1.127: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: The connection does not exist
2331   Mar 31 06:52:06.866995 rpm-ostree[916]: Failed to GetConnectionUnixProcessID for client :1.127: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: The connection does not exist
2332   Mar 31 06:52:06.867138 rpm-ostree[916]: Initiated txn Deploy for client(dbus:1.127 uid:<unknown>): /org/projectatomic/rpmostree1/fedora_coreos

Hmm, yeah OK, so this looks like zincati initiated the transaction, something along the rpm-ostree client -> D-Bus -> rpm-ostree daemon chain hung, then the client gave up, and then unsurprisingly the daemon gets NameHasNoOwner.

@jlebon jlebon changed the title kola aws upgrade tests failing FCOS upgrade tests often flaking on AWS Mar 31, 2020
@jlebon
Copy link
Member Author

jlebon commented Mar 31, 2020

This may not actually be a test issue directly, in which case we could transfer this to the tracker.

@lucab
Copy link
Contributor

lucab commented Mar 31, 2020

@jlebon are those 25 seconds an hardcoded timeout somewhere in rpm-ostree? I can imagine that a small node with underprovisioned network+CPU+IOPS may take quite some time to stage a remote commit, but I haven't encountered this timeout before.

@arithx
Copy link
Contributor

arithx commented Apr 1, 2020

Aside from the issue itself but if either rpm-ostree or zincati are throwing anything on the console log it's probably worth extending the console checks in mantle so it calls that out more verbosely without needing to dig into the logs

@lucab
Copy link
Contributor

lucab commented Apr 1, 2020

@arithx the output snippet above is from the journal, from what I can see. On such errors, Zincati will keep retrying (after ~5 minutes by default). I think the test is correct in timing out aggressively, and the journal has the details.
One possible improvement though would be to snapshot metrics details too.

@jlebon did you turn on Zincati trace logs after the facts? If not, should we add that config fragment directly to kola?

@dustymabe
Copy link
Member

Note that the kola aws test from the 31.20200323.3.0 run passed the test. Which leads me to believe it has something to do with the new zincati. Everything after our 20200323 snapshot (where new zincati was introduced I believe) fails this test now.

https://jenkins-fedora-coreos.apps.ci.centos.org/job/fedora-coreos/job/fedora-coreos-fedora-coreos-pipeline-kola-aws/360/

@lucab
Copy link
Contributor

lucab commented Apr 8, 2020

I was looking at the kola output on qemu of the latest testing release attempt (31.20200407.2.1) which contains latest zincati (0.0.9-1) and it seems to correctly go through the following chain of OS versions:

  1. 31.20200323.2.1 (booted image, current tip of the stream)
  2. 31.20200407.2.1 (new release being attempted on the stream)
  3. 31.20200407.2.1.kola (a synthetic release only for update testing)

However, once in the final OS, zincati fails to start due to the following:

zincati[698]: Error: Error("missing field `coreos-assembler.basearch`", line: 12, column: 7)

Possibly we are forgetting (at least) the above commit metadata when synthesizing the new release commit.

However this does not explain the failure in the initial ticket, which somehow seems to happen only sometimes and on AWS.

@lucab
Copy link
Contributor

lucab commented Apr 8, 2020

I've been trying to follow the code, and I believe the error is coming from this glib dbus wrapper while 25s is default dbus timeout which is also mirrored in glib.

I have no idea why it takes 36 seconds to start processing the client request (small AWS machine? CPU/RAM contention?). The default timeout however is smaller than that.

This may or may not be a problem on a real nodes, though, as Zincati keeps retrying and may eventually hit a faster daemon.

@cgwalters
Copy link
Member

The DBus timeout indeed has been a huge historical pain. Since the main goal of DBus was to wire bits of the OS together, including system services and the desktop, I think the idea behind having a timeout was to avoid a frozen system or userspace process ending up blocking everything.

Of course, what one really wants to do is use asynchronous calls always, but writing 100% asynchronous code in general (particularly in C) is...hard, even though e.g. GLib provides a lot of facilities for it.

It might be that we're hitting issues because zincati is starting during the rest of the OS boot so there's a lot going on.

From a quick look at the sd-bus sources, it looks like they don't do timeouts by default; I vaguely recall this being a notable change w/sd-bus.

@jlebon
Copy link
Member Author

jlebon commented Apr 8, 2020

OK, so I think what's happening here is that the new readonly code is forcing the kernel to flush to disk when rpm-ostree/libostree needs to remount /sysroot as rw. And the way the upgrade test works, we download a huge tarball and then extract it, so there's a lot of pending data waiting to be synced.

When the rpm-ostree transaction is initialized, we do ostree_sysroot_try_lock at the very start, which triggers the rw remount. And if that takes enough time, we don't get back to client quickly enough and D-Bus times out. As to why we don't hit this in QEMU; I guess we have fast enough I/O?

Anyway, testing that theory now with this patch:

diff --git a/mantle/kola/tests/upgrade/basic.go b/mantle/kola/tests/upgrade/basic.go
index ee64b696..15cce3cf 100644
--- a/mantle/kola/tests/upgrade/basic.go
+++ b/mantle/kola/tests/upgrade/basic.go
@@ -125,7 +125,7 @@ func fcosUpgradeBasic(c cluster.TestCluster) {
                        c.Fatal(err)
                }

-               c.MustSSHf(m, "tar -xf %s -C %s", kola.CosaBuild.Meta.BuildArtifacts.Ostree.Path, ostreeRepo)
+               c.MustSSHf(m, "tar -xf %s -C %s && sync", kola.CosaBuild.Meta.BuildArtifacts.Ostree.Path, ostreeRepo)

                graph.seedFromMachine(c, m)
                graph.addUpdate(c, m, kola.CosaBuild.Meta.OstreeVersion, kola.CosaBuild.Meta.OstreeCommit)

But assuming this is indeed it, I don't think this is a showstopping issue in practice, because the kernel does eventually flush and zincati keeps retrying (though I guess on a node with lots of continuous I/O we could get unlucky each time and be stuck waiting for lots of data to sync). Still, I think we should accommodate this in rpm-ostree/libostree, e.g. by delaying the rw remount until after we ack the client on D-Bus.

jlebon added a commit to jlebon/coreos-assembler that referenced this issue Apr 9, 2020
This works around a subtle issue in rpm-ostree/libostree dropping D-Bus
transactions due to `mount` causing a cache flush and hanging for a
while due to slow I/O.

As mentioned in the comment there, we should drop this in the future and
rework things in the stack proper instead so we're not susceptible to
this.

See coreos#1301.
@jlebon
Copy link
Member Author

jlebon commented Apr 9, 2020

OK, confirmed working. Unblocked the releases. Let's also just apply that patch for now as a short-term: #1331

openshift-merge-robot pushed a commit that referenced this issue Apr 9, 2020
This works around a subtle issue in rpm-ostree/libostree dropping D-Bus
transactions due to `mount` causing a cache flush and hanging for a
while due to slow I/O.

As mentioned in the comment there, we should drop this in the future and
rework things in the stack proper instead so we're not susceptible to
this.

See #1301.
@cgwalters
Copy link
Member

Wow, awesome job investigating this! I don't understand why the remount is forcing a flush.

But, the best fix here I think is to move the whole transaction bits to a subprocess in rpm-ostree. See coreos/rpm-ostree#1680

@dustymabe
Copy link
Member

Wow, awesome job investigating this!

Absolutely. Nice work @jlebon!

@miabbott
Copy link
Member

miabbott commented Apr 9, 2020

bravo

@dustymabe
Copy link
Member

I think we can close this since the aws upgrade test hasn't been failing on us.

dustymabe added a commit to dustymabe/coreos-assembler that referenced this issue Sep 24, 2024
The referenced issue in coreos#1301
has long since been fixed.
dustymabe added a commit to dustymabe/coreos-assembler that referenced this issue Sep 24, 2024
The referenced issue in coreos#1301
has long since been fixed.
dustymabe added a commit to dustymabe/coreos-assembler that referenced this issue Sep 24, 2024
The referenced issue in coreos#1301
has long since been fixed.
dustymabe added a commit to dustymabe/coreos-assembler that referenced this issue Sep 25, 2024
The referenced issue in coreos#1301
has long since been fixed.
dustymabe added a commit that referenced this issue Sep 25, 2024
The referenced issue in #1301
has long since been fixed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants