Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Freeze kernel at 5.6.7 due to loop regression breaking blackbox test #976

Merged
merged 1 commit into from
May 8, 2020

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented May 7, 2020

It seems like there's a regression in the 5.6.8 kernel causing our
blackbox tests to fail with e.g.:

blackbox_test.go:114: failed: "mkfs.vfat": exit status 1
    mkfs.vfat: unable to open /dev/loop0p1: No such file or directory
    mkfs.fat 4.1 (2017-01-24)

And looking at dmesg, one can see the partition rescan is failing with
-EBUSY:

__loop_clr_fd: partition scan of loop3 failed (rc=-16)
loop_reread_partitions: partition scan of loop0 (/var/tmp/ignition-blackbox-570148150/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop1 (/var/tmp/ignition-blackbox-134124829/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop2 (/var/tmp/ignition-blackbox-492917208/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop3 (/var/tmp/ignition-blackbox-966528855/hd0) failed (rc=-16)

Looking at the 5.6.8 notes, the only commit that jumps out is
https://lkml.org/lkml/2019/5/6/1059, though it seems focused on loop
devices backed by block devices.

The only other report I found of this is:
https://bugs.archlinux.org/task/66526

Anyway, I don't think this is a serious enough regression to hold the
kernel in FCOS. But I really want blackbox tests to work in CoreOS CI
where it's easy for everyone to inspect results, download, retry, etc..
So let's override the kernel for now.

According to the Arch Linux bug, it seems like it's partially fixed in
5.7 (though I haven't tried it), so we should be able to unfreeze it
then (or if we want, fast-track it once there's a build for f32).

It seems like there's a regression in the 5.6.8 kernel causing our
blackbox tests to fail with e.g.:

```
blackbox_test.go:114: failed: "mkfs.vfat": exit status 1
    mkfs.vfat: unable to open /dev/loop0p1: No such file or directory
    mkfs.fat 4.1 (2017-01-24)
```

And looking at dmesg, one can see the partition rescan is failing with
`-EBUSY`:

```
__loop_clr_fd: partition scan of loop3 failed (rc=-16)
loop_reread_partitions: partition scan of loop0 (/var/tmp/ignition-blackbox-570148150/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop1 (/var/tmp/ignition-blackbox-134124829/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop2 (/var/tmp/ignition-blackbox-492917208/hd0) failed (rc=-16)
loop_reread_partitions: partition scan of loop3 (/var/tmp/ignition-blackbox-966528855/hd0) failed (rc=-16)
```

Looking at the 5.6.8 notes, the only commit that jumps out is
https://lkml.org/lkml/2019/5/6/1059, though it seems focused on loop
devices backed by block devices.

The only other report I found of this is:
https://bugs.archlinux.org/task/66526

Anyway, I don't think this is a serious enough regression to hold the
kernel in FCOS. But I really want blackbox tests to work in CoreOS CI
where it's easy for everyone to inspect results, download, retry, etc..
So let's override the kernel for now.

According to the Arch Linux bug, it seems like it's partially fixed in
5.7 (though I haven't tried it), so we should be able to unfreeze it
then (or if we want, fast-track it once there's a build for f32).
Copy link
Contributor

@arithx arithx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thanks for tracking it down.

Looks like CI had an unrelated networking issue.

@jlebon
Copy link
Member Author

jlebon commented May 7, 2020

@jlebon
Copy link
Member Author

jlebon commented May 7, 2020

Hmm weird, it's still hitting:

Downloading from 'fedora-coreos-pool'...done
error: Cannot download Packages/k/kernel-core-5.6.7-200.fc31.x86_64.rpm: All mirrors were tried; Last error: Curl error (6): Couldn't resolve host name for https://kojipkgs.fedoraproject.org/repos-dist/coreos-pool/latest/x86_64/Packages/k/kernel-5.6.7-200.fc31.x86_64.rpm [Could not resolve host: kojipkgs.fedoraproject.org]

But I can't reproduce this locally after coreos/coreos-assembler#1432. Looking.

@jlebon jlebon force-pushed the pr/override-ci-kernel branch from 5d48013 to 7edf7a3 Compare May 7, 2020 20:48
@jlebon
Copy link
Member Author

jlebon commented May 7, 2020

Wow, that took a while to figure out. So, here we're using the cosa buildroot image. I thought this was fine though, because we automatically rebuild it whenever a cosa image is built. And the latest cosa buildroot image has:

io.openshift.build.commit.author=Jonathan Lebon \u003cjonathan@jlebon.com\u003e 
io.openshift.build.commit.date=Thu May 7 12:51:02 2020 -0400 
io.openshift.build.commit.id=4e6056029a258d3cb08bacffa1e4014e0daa0294 
io.openshift.build.commit.message=cmdlib: Lower cost of cosa RPM overrides repo 
io.openshift.build.commit.ref=master 
io.openshift.build.name=cosa-buildroot-266 
io.openshift.build.namespace=coreos 
io.openshift.build.source-location=https://github.com/coreos/coreos-assembler 

So one would think that it has that commit. Yet, adding cat /cosa/coreos-assembler-git.json to CI here shows:

"commit": "5a07e8a5aef1bdca2272e22cbd9aaed142819f8b",

Which is coreos/coreos-assembler@5a07e8a, which is the parent commit of coreos/coreos-assembler@4e60560.

I thought maybe the CentOS CI cluster downloaded a stale version of the image when running the tests here. Yet, doing an oc describe pod pod-ba8ac29e-4eef-4359-bf58-d2a500eaf5f8-p4f42-2c42l shows:

Containers:
  worker:
    Container ID:       docker://c7b0eb81a499932eb102ced6f83da41001e326d0ab49a61e8c2220b70eddc92f
    Image:              registry.svc.ci.openshift.org/coreos/cosa-buildroot:latest
    Image ID:           docker-pullable://registry.svc.ci.openshift.org/coreos/cosa-buildroot@sha256:fc947ef06299984f290a58515cb2d9b5bc8d5e28e3662c62da2a9685627b89ca

which matched the image ID of the latest buildroot image. So it definitely was pulling the latest.

What was actually happening was that somehow the cosa-buildroot buildconfig in the cluster had lost this bit:

https://github.com/openshift/release/blob/09eaeb175f82b8e71838a39c45f17dc199505852/services/coreos/cosa-buildroot.yaml#L28-L30

Which meant that the Dockerfile.buildroot's FROM quay.io/coreos-assembler/coreos-assembler:latest wasn't being replaced by the imagestream which triggered us in the first place. And what happened then is that we don't have a pull policy to always pull from upstream, so we were using a cached version of the image as the FROM layer.

I re-added those lines and restarted another cosa-buildroot build, then restarted CI here. I'm still not sure how those lines went missing though (but it's not the first time either that things are somehow out of sync).

@cgwalters
Copy link
Member

Which meant that the Dockerfile.buildroot's FROM quay.io/coreos-assembler/coreos-assembler:latest wasn't being replaced by the imagestream which triggered us in the first place. And what happened then is that we don't have a pull policy to always pull from upstream, so we were using a cached version of the image as the FROM layer.

Oh man, that's just evil.

@cgwalters cgwalters merged commit ffc74f4 into coreos:master May 8, 2020
@ashcrow
Copy link
Member

ashcrow commented May 8, 2020

Which meant that the Dockerfile.buildroot's FROM quay.io/coreos-assembler/coreos-assembler:latest wasn't being replaced by the imagestream which triggered us in the first place. And what happened then is that we don't have a pull policy to always pull from upstream, so we were using a cached version of the image as the FROM layer.

Oh man, that's just evil.

Wow 😦 Agreed.

Great debugging work @jlebon!

jlebon added a commit to jlebon/coreos-ci-lib that referenced this pull request May 8, 2020
I had done this temporarily in the `.cci.jenkinsfile` of Ignition in
coreos/ignition#976 to help debugging and I
found it really useful.

Let's just always write it out.
jlebon added a commit to jlebon/coreos-ci-lib that referenced this pull request May 8, 2020
I had done this temporarily in the `.cci.jenkinsfile` of Ignition in
coreos/ignition#976 to help debugging and I
found it really useful.

Let's just always write it out. But don't error out if it somehow
doesn't exist.
jlebon added a commit to coreos/coreos-ci-lib that referenced this pull request May 8, 2020
I had done this temporarily in the `.cci.jenkinsfile` of Ignition in
coreos/ignition#976 to help debugging and I
found it really useful.

Let's just always write it out. But don't error out if it somehow
doesn't exist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants