FCOS upgrade tests often flaking on AWS #1301

jlebon · 2020-03-31T14:29:59Z

We're seeing cosa kola --upgrades flake pretty often on AWS:

=== RUN   fcos.upgrade.basic
=== RUN   fcos.upgrade.basic/upgrade-from-current
--- FAIL: fcos.upgrade.basic (214.28s)
    --- PASS: fcos.upgrade.basic/setup (15.13s)
    --- FAIL: fcos.upgrade.basic/upgrade-from-previous (120.47s)
            basic.go:228: failed waiting for machine reboot: timed out after 2m0s waiting for machine to reboot
    --- SKIP: fcos.upgrade.basic/upgrade-from-current (0.00s)
            cluster.go:50: A previous test has already failed
FAIL, output in tmp/kola-upgrade

At first I thought it was due to AWS just taking longer to reboot. But the logs show that rpm-ostree is actually failing:

2321   Mar 31 06:51:30.767361 zincati[1346]: [TRACE] request to stage release: Release { version: "31.20200331.94.0", checksum: "1cee7b6fda43ac7a22ce1a76feb50cb530fa91e8f5a8f1004e1c667ecf91bb7d", age_index: Some(1) }
2325   Mar 31 06:51:55.803428 zincati[1346]: [ERROR] failed to stage deployment: rpm-ostree deploy failed:
2326   Mar 31 06:51:55.803428 zincati[1346]:     error: Timeout was reached
2327   Mar 31 06:51:55.803428 zincati[1346]:
2328   Mar 31 06:51:55.804386 zincati[1346]: [TRACE] scheduling next agent refresh in 312 seconds
2330   Mar 31 06:52:06.866091 rpm-ostree[916]: Failed to GetConnectionUnixUser for client :1.127: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: The connection does not exist
2331   Mar 31 06:52:06.866995 rpm-ostree[916]: Failed to GetConnectionUnixProcessID for client :1.127: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: The connection does not exist
2332   Mar 31 06:52:06.867138 rpm-ostree[916]: Initiated txn Deploy for client(dbus:1.127 uid:<unknown>): /org/projectatomic/rpmostree1/fedora_coreos

Hmm, yeah OK, so this looks like zincati initiated the transaction, something along the rpm-ostree client -> D-Bus -> rpm-ostree daemon chain hung, then the client gave up, and then unsurprisingly the daemon gets NameHasNoOwner.

The text was updated successfully, but these errors were encountered:

jlebon · 2020-03-31T14:35:33Z

This may not actually be a test issue directly, in which case we could transfer this to the tracker.

lucab · 2020-03-31T14:48:06Z

@jlebon are those 25 seconds an hardcoded timeout somewhere in rpm-ostree? I can imagine that a small node with underprovisioned network+CPU+IOPS may take quite some time to stage a remote commit, but I haven't encountered this timeout before.

arithx · 2020-04-01T05:59:22Z

Aside from the issue itself but if either rpm-ostree or zincati are throwing anything on the console log it's probably worth extending the console checks in mantle so it calls that out more verbosely without needing to dig into the logs

lucab · 2020-04-01T08:02:37Z

@arithx the output snippet above is from the journal, from what I can see. On such errors, Zincati will keep retrying (after ~5 minutes by default). I think the test is correct in timing out aggressively, and the journal has the details.
One possible improvement though would be to snapshot metrics details too.

@jlebon did you turn on Zincati trace logs after the facts? If not, should we add that config fragment directly to kola?

dustymabe · 2020-04-08T03:54:55Z

Note that the kola aws test from the 31.20200323.3.0 run passed the test. Which leads me to believe it has something to do with the new zincati. Everything after our 20200323 snapshot (where new zincati was introduced I believe) fails this test now.

https://jenkins-fedora-coreos.apps.ci.centos.org/job/fedora-coreos/job/fedora-coreos-fedora-coreos-pipeline-kola-aws/360/

lucab · 2020-04-08T08:46:30Z

I was looking at the kola output on qemu of the latest testing release attempt (31.20200407.2.1) which contains latest zincati (0.0.9-1) and it seems to correctly go through the following chain of OS versions:

31.20200323.2.1 (booted image, current tip of the stream)
31.20200407.2.1 (new release being attempted on the stream)
31.20200407.2.1.kola (a synthetic release only for update testing)

However, once in the final OS, zincati fails to start due to the following:

zincati[698]: Error: Error("missing field `coreos-assembler.basearch`", line: 12, column: 7)

Possibly we are forgetting (at least) the above commit metadata when synthesizing the new release commit.

However this does not explain the failure in the initial ticket, which somehow seems to happen only sometimes and on AWS.

lucab · 2020-04-08T11:34:58Z

I've been trying to follow the code, and I believe the error is coming from this glib dbus wrapper while 25s is default dbus timeout which is also mirrored in glib.

I have no idea why it takes 36 seconds to start processing the client request (small AWS machine? CPU/RAM contention?). The default timeout however is smaller than that.

This may or may not be a problem on a real nodes, though, as Zincati keeps retrying and may eventually hit a faster daemon.

cgwalters · 2020-04-08T15:25:54Z

The DBus timeout indeed has been a huge historical pain. Since the main goal of DBus was to wire bits of the OS together, including system services and the desktop, I think the idea behind having a timeout was to avoid a frozen system or userspace process ending up blocking everything.

Of course, what one really wants to do is use asynchronous calls always, but writing 100% asynchronous code in general (particularly in C) is...hard, even though e.g. GLib provides a lot of facilities for it.

It might be that we're hitting issues because zincati is starting during the rest of the OS boot so there's a lot going on.

From a quick look at the sd-bus sources, it looks like they don't do timeouts by default; I vaguely recall this being a notable change w/sd-bus.

jlebon · 2020-04-08T21:58:21Z

OK, so I think what's happening here is that the new readonly code is forcing the kernel to flush to disk when rpm-ostree/libostree needs to remount /sysroot as rw. And the way the upgrade test works, we download a huge tarball and then extract it, so there's a lot of pending data waiting to be synced.

When the rpm-ostree transaction is initialized, we do ostree_sysroot_try_lock at the very start, which triggers the rw remount. And if that takes enough time, we don't get back to client quickly enough and D-Bus times out. As to why we don't hit this in QEMU; I guess we have fast enough I/O?

Anyway, testing that theory now with this patch:

diff --git a/mantle/kola/tests/upgrade/basic.go b/mantle/kola/tests/upgrade/basic.go
index ee64b696..15cce3cf 100644
--- a/mantle/kola/tests/upgrade/basic.go
+++ b/mantle/kola/tests/upgrade/basic.go
@@ -125,7 +125,7 @@ func fcosUpgradeBasic(c cluster.TestCluster) {
                        c.Fatal(err)
                }

-               c.MustSSHf(m, "tar -xf %s -C %s", kola.CosaBuild.Meta.BuildArtifacts.Ostree.Path, ostreeRepo)
+               c.MustSSHf(m, "tar -xf %s -C %s && sync", kola.CosaBuild.Meta.BuildArtifacts.Ostree.Path, ostreeRepo)

                graph.seedFromMachine(c, m)
                graph.addUpdate(c, m, kola.CosaBuild.Meta.OstreeVersion, kola.CosaBuild.Meta.OstreeCommit)

But assuming this is indeed it, I don't think this is a showstopping issue in practice, because the kernel does eventually flush and zincati keeps retrying (though I guess on a node with lots of continuous I/O we could get unlucky each time and be stuck waiting for lots of data to sync). Still, I think we should accommodate this in rpm-ostree/libostree, e.g. by delaying the rw remount until after we ack the client on D-Bus.

This works around a subtle issue in rpm-ostree/libostree dropping D-Bus transactions due to `mount` causing a cache flush and hanging for a while due to slow I/O. As mentioned in the comment there, we should drop this in the future and rework things in the stack proper instead so we're not susceptible to this. See coreos#1301.

jlebon · 2020-04-09T01:31:18Z

OK, confirmed working. Unblocked the releases. Let's also just apply that patch for now as a short-term: #1331

This works around a subtle issue in rpm-ostree/libostree dropping D-Bus transactions due to `mount` causing a cache flush and hanging for a while due to slow I/O. As mentioned in the comment there, we should drop this in the future and rework things in the stack proper instead so we're not susceptible to this. See #1301.

cgwalters · 2020-04-09T12:45:54Z

Wow, awesome job investigating this! I don't understand why the remount is forcing a flush.

But, the best fix here I think is to move the whole transaction bits to a subprocess in rpm-ostree. See coreos/rpm-ostree#1680

dustymabe · 2020-04-09T13:10:46Z

Wow, awesome job investigating this!

Absolutely. Nice work @jlebon!

miabbott · 2020-04-09T15:03:30Z

dustymabe · 2020-06-08T16:47:08Z

I think we can close this since the aws upgrade test hasn't been failing on us.

The referenced issue in coreos#1301 has long since been fixed.

The referenced issue in #1301 has long since been fixed.

jlebon changed the title ~~kola aws upgrade tests failing~~ FCOS upgrade tests often flaking on AWS Mar 31, 2020

dustymabe mentioned this issue Mar 31, 2020

testing: new release on 2020-03-31 (31.20200323.2.1) coreos/fedora-coreos-streams#76

Closed

35 tasks

This was referenced Apr 8, 2020

stable: new release on 2020-04-08 (31.20200323.3.2) coreos/fedora-coreos-streams#79

Closed

testing: new release on 2020-04-08 (31.20200407.2.2) coreos/fedora-coreos-streams#80

Closed

jlebon mentioned this issue Apr 9, 2020

fcos.upgrade.basic: sync after untarring OSTree tarball #1331

Merged

dustymabe closed this as completed Jun 8, 2020

cgwalters mentioned this issue Jul 20, 2022

Support delegation of privilege using LoadCredential=, add socket activation coreos/rpm-ostree#3850

Open

dustymabe added a commit to dustymabe/coreos-assembler that referenced this issue Sep 24, 2024

mantle/kola: remove sync workaround

788604a

The referenced issue in coreos#1301 has long since been fixed.

dustymabe added a commit to dustymabe/coreos-assembler that referenced this issue Sep 24, 2024

mantle/kola: remove sync workaround

1fd3649

The referenced issue in coreos#1301 has long since been fixed.

dustymabe added a commit to dustymabe/coreos-assembler that referenced this issue Sep 24, 2024

mantle/kola: remove sync workaround

d152b3b

The referenced issue in coreos#1301 has long since been fixed.

dustymabe added a commit to dustymabe/coreos-assembler that referenced this issue Sep 25, 2024

mantle/kola: remove sync workaround

8a1741a

The referenced issue in coreos#1301 has long since been fixed.

dustymabe added a commit that referenced this issue Sep 25, 2024

mantle/kola: remove sync workaround

845be98

The referenced issue in #1301 has long since been fixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FCOS upgrade tests often flaking on AWS #1301

FCOS upgrade tests often flaking on AWS #1301

jlebon commented Mar 31, 2020

jlebon commented Mar 31, 2020

lucab commented Mar 31, 2020

arithx commented Apr 1, 2020

lucab commented Apr 1, 2020

dustymabe commented Apr 8, 2020

lucab commented Apr 8, 2020 •

edited

Loading

lucab commented Apr 8, 2020

cgwalters commented Apr 8, 2020

jlebon commented Apr 8, 2020

jlebon commented Apr 9, 2020

cgwalters commented Apr 9, 2020

dustymabe commented Apr 9, 2020

miabbott commented Apr 9, 2020

dustymabe commented Jun 8, 2020

FCOS upgrade tests often flaking on AWS #1301

FCOS upgrade tests often flaking on AWS #1301

Comments

jlebon commented Mar 31, 2020

jlebon commented Mar 31, 2020

lucab commented Mar 31, 2020

arithx commented Apr 1, 2020

lucab commented Apr 1, 2020

dustymabe commented Apr 8, 2020

lucab commented Apr 8, 2020 • edited Loading

lucab commented Apr 8, 2020

cgwalters commented Apr 8, 2020

jlebon commented Apr 8, 2020

jlebon commented Apr 9, 2020

cgwalters commented Apr 9, 2020

dustymabe commented Apr 9, 2020

miabbott commented Apr 9, 2020

dustymabe commented Jun 8, 2020

lucab commented Apr 8, 2020 •

edited

Loading