Zuul Job failures #65

chavafg · 2018-09-14T00:05:38Z

Recently, the zuul jobs have been failing. I see different failures:

Job Time out.
For some reason, some of the jobs have a timeout and unfortunately, it seems that there is no way to check why (or in which step) the job has timed out.
For example: http://logs.openstack.org/06/706/225e10cfc4bb99722b6f5734a1e840138bcea8a0/third-party-check/kata-runsh/e318be6/ara-report/
shows that the run.yaml could not be executed and no logs of that task are available.
On the other hand I see that the post.yaml was executed, which collect the kata logs, meaning that the machine didn't hang, so I was wondering if there could be a way to know the reason of this timeout.
Unable to apply a git patch
Seems that sometimes, we are unable to apply a patch from git:

Applying patch: /home/zuul/src/github.com/kata-containers/packaging/obs-packaging/qemu-lite/patches/0001-memfd-fix-configure-test.patch

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
fatal: empty ident name (for <zuul@ubuntu-xenial-vexxhost-vexxhost-sjc1-0001993195.(none)>) not allowed

For this one, I think we have 2 options: 1. add a git config to the zuul jobs before running the setup or 2. change git am for patch

For some reason, sometimes the vexxhost machine where we run the CI do not have nested virtualization enabled:

time="2018-09-13T17:01:05Z" level=info msg="CPU property found" arch=amd64 description="Intel Architecture CPU" name=GenuineIntel pid=29482 source=runtime type=attribute
time="2018-09-13T17:01:05Z" level=error msg="CPU property not found" arch=amd64 description="Virtualization support" name=vmx pid=29482 source=runtime type=flag
time="2018-09-13T17:01:05Z" level=info msg="CPU property found" arch=amd64 description="64Bit CPU" name=lm pid=29482 source=runtime type=flag
time="2018-09-13T17:01:05Z" level=info msg="CPU property found" arch=amd64 description=SSE4.1 name=sse4_1 pid=29482 source=runtime type=flag
time="2018-09-13T17:01:05Z" level=info msg="kernel property found" arch=amd64 description="Kernel-based Virtual Machine" name=kvm pid=29482 source=runtime type=module
time="2018-09-13T17:01:05Z" level=info msg="kernel property found" arch=amd64 description="Host kernel accelerator for virtio" name=vhost pid=29482 source=runtime type=module
time="2018-09-13T17:01:05Z" level=info msg="kernel property found" arch=amd64 description="Host kernel accelerator for virtio network" name=vhost_net pid=29482 source=runtime type=module
time="2018-09-13T17:01:05Z" level=info msg="kernel property found" arch=amd64 description="Intel KVM" name=kvm_intel pid=29482 source=runtime type=module
time="2018-09-13T17:01:05Z" level=error msg="open /sys/module/kvm_intel/parameters/nested: no such file or directory" arch=amd64 name=kata-runtime pid=29482 source=runtime
open /sys/module/kvm_intel/parameters/nested: no such file or directory

The text was updated successfully, but these errors were encountered:

jodh-intel · 2018-09-14T06:27:13Z

Related: #42.

cboylan · 2018-09-14T14:45:18Z

For 1 I have pushed https://review.openstack.org/602627 to double the timeout to two hours from one hour. We timeout the job run itself independent of the log collection so that we can debug things (which is why you see all of the kata logs) unfortunately due to the way ara works it has a hard time when we stop ansible under it. The good news is you can see the text version of the log at http://logs.openstack.org/06/706/225e10cfc4bb99722b6f5734a1e840138bcea8a0/third-party-check/kata-runsh/e318be6/job-output.txt.gz. My read of that is the job was just taking its time and the timeout is too short but let us know if that isn't how you read it.

For 2 I pushed https://review.openstack.org/602628 which will globally configure some generic throwaway git identity details.

For 3 we'll need more information to be sure of what is happening. Can you provide a link to the log files? One potential reason for that is Vexxhost turned on a new region which we are using that should have nested virt enabled everywhere, but may be a hypervisor is misconfigured. If you can give us the log link @mnaser can use those details to check this.

cboylan · 2018-09-14T15:01:56Z

Managed to catch on on 3 myself. @mnaser http://logs.openstack.org/74/74/a77fc6b82dab0368f1cdc4d4d39ef6390c7a9526/third-party-check/kata-runsh/f17b1f9/job-output.txt.gz#_2018-09-14_14_53_28_559500 ran in sjc1 and if you scroll to http://logs.openstack.org/74/74/a77fc6b82dab0368f1cdc4d4d39ef6390c7a9526/third-party-check/kata-runsh/f17b1f9/job-output.txt.gz#_2018-09-14_14_57_59_437331 it claims to not have vmx. Let me know if that instance info there isn't enough info for you and I will find the instance uuid on our end.

As indicated at kata-containers/ci#65 our existing hour long timeout is not sufficient. Bump up to two hours to get plenty of room on this. Change-Id: I39a706fe70f0f552a7bb986765acef065bbbace1

The kata test jobs apply patches to git repos with git. This creates commits which requires you have a user configured in git. Set up a global git config with generic Zuul identify info in it to address this. More details at kata-containers/ci#65 Change-Id: I08a6a13501fad92cd290f0a9e5559f61b11d7fab

mnaser · 2018-09-17T13:01:11Z

@cboylan this should be resolved, sjc1 now has vmx again.

chavafg · 2018-09-17T14:46:43Z

Got this error on last recheck:

1..32
not ok 1 ctr not found correct error message
# (from function `start_crio' in file helpers.bash, line 211,
#  in test file ctr.bats, line 10)
#   `start_crio' failed
# time="2018-09-17T13:26:37Z" level=error msg="error opening storage: /dev/vdb is not available for use with devicemapper"

@mnaser, our cri-o tests require a /dev/vdb device. Can this also be added to sjc1?

mnaser · 2018-09-17T15:13:17Z

Is there a way to work around this in your CI to remove that expectation? Maybe using a loopback device?

chavafg · 2018-09-17T15:15:06Z

iirc we have disable support for loopback devices, @sboeuf?

sboeuf · 2018-09-17T15:22:13Z

That was about us running CRI-O tests in a stable environment. CRI-O was not stable using loopback devices, that's why we moved away from it.

chavafg · 2018-09-17T15:24:39Z

@mnaser @cboylan we have this /dev/vdb device on the VMs that jenkins launch, is there a way to also restrict that VM type for the Zuul CI?

mnaser · 2018-09-17T16:11:55Z

Kata currently runs it's own custom flavour in the Jenkins CI while it uses our normal flavours in SJC1 (Zuul).

I really suggest that we come up with a solution together for this, as not having this means that no one can run those tests on their own VMs.

@cboylan any workaround ideas?

cboylan · 2018-09-17T16:31:23Z

Might help to have more info about what the block device is used for.

But generally, my suggestion would be to use a loopback device. At least when testing Swift and Cinder we've used them with success. With Swift to provide an XFS filesystem regardless of the host system. With Cinder to provide a dedicated vgs out of which cinder can provision lvs with lvm.

chavafg · 2018-09-18T17:35:00Z

So I have tested locally the use of loopback devices and it still works.
I could add a condition in our cri-o setup to use loopback when running on Zuul (or when not running on Jenkins). We need to have in mind that using loopback devices will increase ~5 minutes of execution time since it is slower.

wdyt @sboeuf @egernst ?

sboeuf · 2018-09-18T17:38:13Z

@chavafg increasing 5 min is a lot... Also, my main concern is about the stability of the CI. I want to make sure we don't end up with some inconsistent failures from the CRI-O tests because we're using a loopback device.

cboylan · 2018-09-18T20:46:04Z

To clarify this isn't really a Zuul vs Jenkins issue as much as it is a vexxhost cloud region A vs region B issue. Zuul (and Nodepool) are able to speak to the new Vexxhost region where mnaser would prefer to not set up kata specific flavors to run the jobs (at least that is my understanding).

The reason for not doing that is to use something a bit more generic which ensures others can run these tests too.

The upside to using multiple regions is more resources overall but also more availability as we can lose an entire cloud region and keep running test jobs.

grahamwhaley · 2019-01-10T14:49:35Z

@chavafg - can we close this one now?

chavafg · 2019-01-10T15:11:17Z

yes, lets close this one. the issues described here are already solved.

This will check that is possible to perform a yum update inside a container. Fixes kata-containers#65 Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>

chavafg closed this as completed Jan 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zuul Job failures #65

Zuul Job failures #65

chavafg commented Sep 14, 2018

jodh-intel commented Sep 14, 2018

cboylan commented Sep 14, 2018

cboylan commented Sep 14, 2018

mnaser commented Sep 17, 2018

chavafg commented Sep 17, 2018

mnaser commented Sep 17, 2018

chavafg commented Sep 17, 2018

sboeuf commented Sep 17, 2018

chavafg commented Sep 17, 2018

mnaser commented Sep 17, 2018

cboylan commented Sep 17, 2018

chavafg commented Sep 18, 2018

sboeuf commented Sep 18, 2018

cboylan commented Sep 18, 2018

grahamwhaley commented Jan 10, 2019

chavafg commented Jan 10, 2019

Zuul Job failures #65

Zuul Job failures #65

Comments

chavafg commented Sep 14, 2018

jodh-intel commented Sep 14, 2018

cboylan commented Sep 14, 2018

cboylan commented Sep 14, 2018

mnaser commented Sep 17, 2018

chavafg commented Sep 17, 2018

mnaser commented Sep 17, 2018

chavafg commented Sep 17, 2018

sboeuf commented Sep 17, 2018

chavafg commented Sep 17, 2018

mnaser commented Sep 17, 2018

cboylan commented Sep 17, 2018

chavafg commented Sep 18, 2018

sboeuf commented Sep 18, 2018

cboylan commented Sep 18, 2018

grahamwhaley commented Jan 10, 2019

chavafg commented Jan 10, 2019