Skip to content
This repository has been archived by the owner on May 7, 2021. It is now read-only.

Kola tests flake because reboots fail randomly #768

Open
ajeddeloh opened this issue Nov 21, 2017 · 4 comments
Open

Kola tests flake because reboots fail randomly #768

ajeddeloh opened this issue Nov 21, 2017 · 4 comments

Comments

@ajeddeloh
Copy link
Contributor

kola tests will frequently cause the nightly to fail because the machine goes down for a reboot and doesn't come back up for unknown reasons. console.txt shows the machine going down for a reboot then ends.

Example $ tail console.txt from a failed coreos.locksmith.reboot:

[   14.296903] EXT4-fs (vda9): re-mounted. Opts: data=ordered
[   14.297540] systemd-shutdown[1]: Remounting '/usr' read-only with options 'seclabel,block_validity,delalloc,barrier,user_xattr,acl'.
[   14.298362] EXT4-fs (dm-0): re-mounted. Opts: block_validity,delalloc,barrier,user_xattr,acl
[   14.299057] systemd-shutdown[1]: Unmounting /usr.
[   14.299377] systemd-shutdown[1]: Could not unmount /usr: Device or resource busy
[   14.299857] systemd-shutdown[1]: Remounting '/' read-only with options 'seclabel,data=ordered'.
[   14.300472] EXT4-fs (vda9): re-mounted. Opts: data=ordered
[   14.306517] Unregister pv shared memory for cpu 0
[   14.307076] reboot: Restarting system
[   14.307332] reboot: machine restart

I've only observed this with the nightly builds. I'm currently trying to reproduce locally and will update this bug once I ensure I can reproduce locally.

My guess is that qemu itself is dying for some reason.

@bgilbert
Copy link
Contributor

Is this a qemu-specific problem?

@ajeddeloh
Copy link
Contributor Author

It looks like it - I can't find any failures of this kind on other platforms. qemu_uefi is hard to tell since those time out half the time (woo, another bug).

I'm fairly sure this is the reason we get the a lot of weird test flakes, especially the coreos.verity.* ones.

@ajeddeloh
Copy link
Contributor Author

Looking at console output from failed nightly builds now that #775 has merged, it appears qemu is either exiting with status 0, is being killed in error, or is getting OOM-killed. Logs from the jenkins worker do not show it being OOM-killed.

@ajeddeloh
Copy link
Contributor Author

Confirmed kola does log stderr from qemu to stderr as well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants