-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: Freeze kernel at 5.6.7 due to loop regression breaking blackbox test #976
Conversation
It seems like there's a regression in the 5.6.8 kernel causing our blackbox tests to fail with e.g.: ``` blackbox_test.go:114: failed: "mkfs.vfat": exit status 1 mkfs.vfat: unable to open /dev/loop0p1: No such file or directory mkfs.fat 4.1 (2017-01-24) ``` And looking at dmesg, one can see the partition rescan is failing with `-EBUSY`: ``` __loop_clr_fd: partition scan of loop3 failed (rc=-16) loop_reread_partitions: partition scan of loop0 (/var/tmp/ignition-blackbox-570148150/hd0) failed (rc=-16) loop_reread_partitions: partition scan of loop1 (/var/tmp/ignition-blackbox-134124829/hd0) failed (rc=-16) loop_reread_partitions: partition scan of loop2 (/var/tmp/ignition-blackbox-492917208/hd0) failed (rc=-16) loop_reread_partitions: partition scan of loop3 (/var/tmp/ignition-blackbox-966528855/hd0) failed (rc=-16) ``` Looking at the 5.6.8 notes, the only commit that jumps out is https://lkml.org/lkml/2019/5/6/1059, though it seems focused on loop devices backed by block devices. The only other report I found of this is: https://bugs.archlinux.org/task/66526 Anyway, I don't think this is a serious enough regression to hold the kernel in FCOS. But I really want blackbox tests to work in CoreOS CI where it's easy for everyone to inspect results, download, retry, etc.. So let's override the kernel for now. According to the Arch Linux bug, it seems like it's partially fixed in 5.7 (though I haven't tried it), so we should be able to unfreeze it then (or if we want, fast-track it once there's a build for f32).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thanks for tracking it down.
Looks like CI had an unrelated networking issue.
Hmm weird, it's still hitting:
But I can't reproduce this locally after coreos/coreos-assembler#1432. Looking. |
5d48013
to
7edf7a3
Compare
Wow, that took a while to figure out. So, here we're using the cosa buildroot image. I thought this was fine though, because we automatically rebuild it whenever a cosa image is built. And the latest cosa buildroot image has:
So one would think that it has that commit. Yet, adding
Which is coreos/coreos-assembler@5a07e8a, which is the parent commit of coreos/coreos-assembler@4e60560. I thought maybe the CentOS CI cluster downloaded a stale version of the image when running the tests here. Yet, doing an
which matched the image ID of the latest buildroot image. So it definitely was pulling the latest. What was actually happening was that somehow the Which meant that the I re-added those lines and restarted another |
Oh man, that's just evil. |
Wow 😦 Agreed. Great debugging work @jlebon! |
I had done this temporarily in the `.cci.jenkinsfile` of Ignition in coreos/ignition#976 to help debugging and I found it really useful. Let's just always write it out.
I had done this temporarily in the `.cci.jenkinsfile` of Ignition in coreos/ignition#976 to help debugging and I found it really useful. Let's just always write it out. But don't error out if it somehow doesn't exist.
I had done this temporarily in the `.cci.jenkinsfile` of Ignition in coreos/ignition#976 to help debugging and I found it really useful. Let's just always write it out. But don't error out if it somehow doesn't exist.
It seems like there's a regression in the 5.6.8 kernel causing our
blackbox tests to fail with e.g.:
And looking at dmesg, one can see the partition rescan is failing with
-EBUSY
:Looking at the 5.6.8 notes, the only commit that jumps out is
https://lkml.org/lkml/2019/5/6/1059, though it seems focused on loop
devices backed by block devices.
The only other report I found of this is:
https://bugs.archlinux.org/task/66526
Anyway, I don't think this is a serious enough regression to hold the
kernel in FCOS. But I really want blackbox tests to work in CoreOS CI
where it's easy for everyone to inspect results, download, retry, etc..
So let's override the kernel for now.
According to the Arch Linux bug, it seems like it's partially fixed in
5.7 (though I haven't tried it), so we should be able to unfreeze it
then (or if we want, fast-track it once there's a build for f32).