-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zuul Job failures #65
Comments
Related: #42. |
For 1 I have pushed https://review.openstack.org/602627 to double the timeout to two hours from one hour. We timeout the job run itself independent of the log collection so that we can debug things (which is why you see all of the kata logs) unfortunately due to the way ara works it has a hard time when we stop ansible under it. The good news is you can see the text version of the log at http://logs.openstack.org/06/706/225e10cfc4bb99722b6f5734a1e840138bcea8a0/third-party-check/kata-runsh/e318be6/job-output.txt.gz. My read of that is the job was just taking its time and the timeout is too short but let us know if that isn't how you read it. For 2 I pushed https://review.openstack.org/602628 which will globally configure some generic throwaway git identity details. For 3 we'll need more information to be sure of what is happening. Can you provide a link to the log files? One potential reason for that is Vexxhost turned on a new region which we are using that should have nested virt enabled everywhere, but may be a hypervisor is misconfigured. If you can give us the log link @mnaser can use those details to check this. |
Managed to catch on on 3 myself. @mnaser http://logs.openstack.org/74/74/a77fc6b82dab0368f1cdc4d4d39ef6390c7a9526/third-party-check/kata-runsh/f17b1f9/job-output.txt.gz#_2018-09-14_14_53_28_559500 ran in sjc1 and if you scroll to http://logs.openstack.org/74/74/a77fc6b82dab0368f1cdc4d4d39ef6390c7a9526/third-party-check/kata-runsh/f17b1f9/job-output.txt.gz#_2018-09-14_14_57_59_437331 it claims to not have vmx. Let me know if that instance info there isn't enough info for you and I will find the instance uuid on our end. |
As indicated at kata-containers/ci#65 our existing hour long timeout is not sufficient. Bump up to two hours to get plenty of room on this. Change-Id: I39a706fe70f0f552a7bb986765acef065bbbace1
The kata test jobs apply patches to git repos with git. This creates commits which requires you have a user configured in git. Set up a global git config with generic Zuul identify info in it to address this. More details at kata-containers/ci#65 Change-Id: I08a6a13501fad92cd290f0a9e5559f61b11d7fab
@cboylan this should be resolved, |
Got this error on last
@mnaser, our cri-o tests require a |
Is there a way to work around this in your CI to remove that expectation? Maybe using a loopback device? |
iirc we have disable support for loopback devices, @sboeuf? |
That was about us running CRI-O tests in a stable environment. CRI-O was not stable using loopback devices, that's why we moved away from it. |
Kata currently runs it's own custom flavour in the Jenkins CI while it uses our normal flavours in SJC1 (Zuul). I really suggest that we come up with a solution together for this, as not having this means that no one can run those tests on their own VMs. @cboylan any workaround ideas? |
Might help to have more info about what the block device is used for. But generally, my suggestion would be to use a loopback device. At least when testing Swift and Cinder we've used them with success. With Swift to provide an XFS filesystem regardless of the host system. With Cinder to provide a dedicated vgs out of which cinder can provision lvs with lvm. |
So I have tested locally the use of loopback devices and it still works. |
@chavafg increasing 5 min is a lot... Also, my main concern is about the stability of the CI. I want to make sure we don't end up with some inconsistent failures from the CRI-O tests because we're using a loopback device. |
To clarify this isn't really a Zuul vs Jenkins issue as much as it is a vexxhost cloud region A vs region B issue. Zuul (and Nodepool) are able to speak to the new Vexxhost region where mnaser would prefer to not set up kata specific flavors to run the jobs (at least that is my understanding). The reason for not doing that is to use something a bit more generic which ensures others can run these tests too. The upside to using multiple regions is more resources overall but also more availability as we can lose an entire cloud region and keep running test jobs. |
@chavafg - can we close this one now? |
yes, lets close this one. the issues described here are already solved. |
This will check that is possible to perform a yum update inside a container. Fixes kata-containers#65 Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Recently, the zuul jobs have been failing. I see different failures:
Job Time out.
For some reason, some of the jobs have a timeout and unfortunately, it seems that there is no way to check why (or in which step) the job has timed out.
For example: http://logs.openstack.org/06/706/225e10cfc4bb99722b6f5734a1e840138bcea8a0/third-party-check/kata-runsh/e318be6/ara-report/
shows that the
run.yaml
could not be executed and no logs of that task are available.On the other hand I see that the
post.yaml
was executed, which collect the kata logs, meaning that the machine didn't hang, so I was wondering if there could be a way to know the reason of this timeout.Unable to apply a git patch
Seems that sometimes, we are unable to apply a patch from git:
For this one, I think we have 2 options: 1. add a git config to the zuul jobs before running the setup or 2. change
git am
forpatch
The text was updated successfully, but these errors were encountered: