-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FCOS upgrade tests often flaking on AWS #1301
Comments
This may not actually be a test issue directly, in which case we could transfer this to the tracker. |
@jlebon are those 25 seconds an hardcoded timeout somewhere in rpm-ostree? I can imagine that a small node with underprovisioned network+CPU+IOPS may take quite some time to stage a remote commit, but I haven't encountered this timeout before. |
Aside from the issue itself but if either rpm-ostree or zincati are throwing anything on the console log it's probably worth extending the console checks in mantle so it calls that out more verbosely without needing to dig into the logs |
@arithx the output snippet above is from the journal, from what I can see. On such errors, Zincati will keep retrying (after ~5 minutes by default). I think the test is correct in timing out aggressively, and the journal has the details. @jlebon did you turn on Zincati trace logs after the facts? If not, should we add that config fragment directly to kola? |
Note that the kola aws test from the |
I was looking at the kola output on
However, once in the final OS, zincati fails to start due to the following:
Possibly we are forgetting (at least) the above commit metadata when synthesizing the new release commit. However this does not explain the failure in the initial ticket, which somehow seems to happen only sometimes and on AWS. |
I've been trying to follow the code, and I believe the error is coming from this glib dbus wrapper while 25s is default dbus timeout which is also mirrored in glib. I have no idea why it takes 36 seconds to start processing the client request (small AWS machine? CPU/RAM contention?). The default timeout however is smaller than that. This may or may not be a problem on a real nodes, though, as Zincati keeps retrying and may eventually hit a faster daemon. |
The DBus timeout indeed has been a huge historical pain. Since the main goal of DBus was to wire bits of the OS together, including system services and the desktop, I think the idea behind having a timeout was to avoid a frozen system or userspace process ending up blocking everything. Of course, what one really wants to do is use asynchronous calls always, but writing 100% asynchronous code in general (particularly in C) is...hard, even though e.g. GLib provides a lot of facilities for it. It might be that we're hitting issues because zincati is starting during the rest of the OS boot so there's a lot going on. From a quick look at the sd-bus sources, it looks like they don't do timeouts by default; I vaguely recall this being a notable change w/sd-bus. |
OK, so I think what's happening here is that the new readonly code is forcing the kernel to flush to disk when rpm-ostree/libostree needs to remount When the rpm-ostree transaction is initialized, we do Anyway, testing that theory now with this patch: diff --git a/mantle/kola/tests/upgrade/basic.go b/mantle/kola/tests/upgrade/basic.go
index ee64b696..15cce3cf 100644
--- a/mantle/kola/tests/upgrade/basic.go
+++ b/mantle/kola/tests/upgrade/basic.go
@@ -125,7 +125,7 @@ func fcosUpgradeBasic(c cluster.TestCluster) {
c.Fatal(err)
}
- c.MustSSHf(m, "tar -xf %s -C %s", kola.CosaBuild.Meta.BuildArtifacts.Ostree.Path, ostreeRepo)
+ c.MustSSHf(m, "tar -xf %s -C %s && sync", kola.CosaBuild.Meta.BuildArtifacts.Ostree.Path, ostreeRepo)
graph.seedFromMachine(c, m)
graph.addUpdate(c, m, kola.CosaBuild.Meta.OstreeVersion, kola.CosaBuild.Meta.OstreeCommit) But assuming this is indeed it, I don't think this is a showstopping issue in practice, because the kernel does eventually flush and zincati keeps retrying (though I guess on a node with lots of continuous I/O we could get unlucky each time and be stuck waiting for lots of data to sync). Still, I think we should accommodate this in rpm-ostree/libostree, e.g. by delaying the rw remount until after we ack the client on D-Bus. |
This works around a subtle issue in rpm-ostree/libostree dropping D-Bus transactions due to `mount` causing a cache flush and hanging for a while due to slow I/O. As mentioned in the comment there, we should drop this in the future and rework things in the stack proper instead so we're not susceptible to this. See coreos#1301.
OK, confirmed working. Unblocked the releases. Let's also just apply that patch for now as a short-term: #1331 |
This works around a subtle issue in rpm-ostree/libostree dropping D-Bus transactions due to `mount` causing a cache flush and hanging for a while due to slow I/O. As mentioned in the comment there, we should drop this in the future and rework things in the stack proper instead so we're not susceptible to this. See #1301.
Wow, awesome job investigating this! I don't understand why the remount is forcing a flush. But, the best fix here I think is to move the whole transaction bits to a subprocess in rpm-ostree. See coreos/rpm-ostree#1680 |
Absolutely. Nice work @jlebon! |
I think we can close this since the aws upgrade test hasn't been failing on us. |
The referenced issue in coreos#1301 has long since been fixed.
The referenced issue in coreos#1301 has long since been fixed.
The referenced issue in coreos#1301 has long since been fixed.
The referenced issue in coreos#1301 has long since been fixed.
The referenced issue in #1301 has long since been fixed.
We're seeing
cosa kola --upgrades
flake pretty often on AWS:At first I thought it was due to AWS just taking longer to reboot. But the logs show that rpm-ostree is actually failing:
Hmm, yeah OK, so this looks like zincati initiated the transaction, something along the rpm-ostree client -> D-Bus -> rpm-ostree daemon chain hung, then the client gave up, and then unsurprisingly the daemon gets
NameHasNoOwner
.The text was updated successfully, but these errors were encountered: