Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Jenkins CI Infrastructure on ARM for Kata-container #516

Closed
kalyxin02 opened this issue Jul 20, 2018 · 21 comments
Closed

Jenkins CI Infrastructure on ARM for Kata-container #516

kalyxin02 opened this issue Jul 20, 2018 · 21 comments

Comments

@kalyxin02
Copy link

We are working on the public Jenkins CI setup on ARM platform. The first step should be making one ARM server be recognized by current Jenkins master and can scheduled the server as one of its Jenkins slave. The n we can run a number of "safe" CI jobs which we have already tested on it. In parallel, we will continue to progress on making more and more CI jobs be successful. See #472. But anyway, setting up the physical infrastructure is the first step.
@grahamwhaley @chavafg Thanks for the help.

@grahamwhaley
Copy link
Contributor

Thanks @kalyxin02 First thing we probably need to work out is the connection method between the master (http://jenkins.katacontainers.io/ - which is an always-on cloud hosted VM I believe, with access to 'the internet') - and your slave machine.
(as I suspect you know already) Jenkins has a number of methods to achieve this.
If your slave machine is publicly visible then maybe we can use an SSH connection for launching the agent on the slave machine.
If your slave is not publicly visible - maybe behind a proxy or NAT or similar - then I think we'd need to set up a 'JNLP' type connection (where iirc the slave calls out to the master) (https://wiki.jenkins.io/display/JENKINS/Distributed+builds).

Let us know your slave connectivity setup, and then we can provide/swap the appropriate keys etc., and let's get that slave online.

We'll also have to work out how we filter/assign jobs to just that slave (and stop it try to process all/any jobs for now) - previously I have used Jenkins labels for this - @chavafg , do we have any job/slave allocation filtering in place at present?

I think @kalyxin02 and @chavafg can work together to set this up? thx!

@kalyxin02
Copy link
Author

Hi, @grahamwhaley @chavafg Thanks for the scenario. Our sever is on packet.net which is publicly visible i believe. So we can use SSH connection. Currently, we can provide you the information about the slave machine are: IP address, username, private key for ssh using the username. I have sent out a email for these info to you all already. Wish this help. Anything needed, please let us know. Thanks!

For the filter for the job allocation, @Pennyzct has a pull request to add a relationship between the architecture and the filtered jobs at #514. Maybe you all may have a look and comment. Thanks!

One more thing we want to know is about the software installation needed on the slave. Any configuration files? tools - java, ant, or maven or something else? Thanks!

@grahamwhaley
Copy link
Contributor

Thanks for the email - got it. Hopefully @chavafg can get the key loaded and we can test the slave/agent launch connection.
One requirement I do know you will need on the slave is for the Java runtime to be installed - the jenkins agent is a java program, so needs java installed to launch.
After that any requirements for installed components comes down to the Kata scripts. I'm not sure if we have a base list of items that must be installed - @chavafg , do you know off the top of your head (having just installed some new slaves etc.) - do the scripts handle everything (including go, docker etc.) - I have a feeling they do.

It will not surprise me if the distro-specific install script might need tweaking for the addition of an architecture - we'll find that out the first time we launch a proper job (although, it can probably be tested and debugged by hand on the slave, which might be more efficient).

For the job allocation/launching - yes, there are two 'parts' we need to work out.

  1. The CI jobs need to work on all arch's - that is what ci: add filter scheme for architecture-independent #514 is enabling I believe
  2. We need to get Jenkins to schedule a build of each PR on all architectures. How we configure Jenkins to do that we have not defined yet. If we add the ARM slave as-is today then I think each PR would be scheduled on either an ARM slave or x86 slave depending on the availability :-), which is not what we want...

I think I mentioned above maybe, I think there are two ways to configure Jenkins to do this:

  1. We can duplicate all the build jobs on the master, and use labels to have one set run on ARM and one on x86. This is probably the least upfront config/cost, but is maybe not the best in efficiency (we will be duplicating our links to github etc.) or complexity
  2. I believe we can get Jenkins set up in a 'matrix build' configuration, where it knows each job has to be built on multiple architectures. This is probably the right path, but I don't believe our Jenkins jobs are configures like that at present.

I would suggest we investigate the 'matrix' abilities of Jenkins, and set up a test Job for trialling this. Once we are happy that works then we could roll it out to all the other CI jobs.

I'm sort of hoping that the parallel functionality of Jenkins is not purely tied to Jenkins pipeline/DSL - as we currently run 'freestyle' jobs, and I'm not sure if freestyle projects will tie into that. More research required....

@kalyxin02
Copy link
Author

Hi @grahamwhaley @chavafg, thanks for the connection. I can saw "arm01_slave" on jenkins master page now. Great the first step! We installed JDK on the slave, and we will try the https://github.com/kata-containers/tests/blob/master/.ci/setup.sh you point out. We have tested the scripts before, and we need to workaround some of the environment setting up. We don't have nested virtualization support now, so we can't run jobs in virtual machine - we have to run the jobs on bare metal, and we are trying to figure out whether we could have the jobs to run in containers. Let's see how we can progress for the setup.sh.

For the job allocation and launching,

1. The CI jobs need to work on all arch's - that is what #514 is enabling I believe

I guess not all the CI jobs need to work on all archs? It depends on the test cases themselves... What #514 wants to achieve is to make different set of test cases run on different arch - in current case, it can make ARM CI just test the job which we have already verified and they won't block the current PR to be merged. And then we can add more and more bits of the jobs into the successful lists. The benefit is that we can save time and figure out how to setup the suitable build of jenkins for different platform first.

For the "matrix" build, I agree with you about the usage. Per my understanding, the hard part is to define the suitable and efficient "axis" to generate the test matrix, and sometimes add necessary filters here. Though the configuration itself is comprehensible.

For parallel, do you mean to maintain the order of the execution of different jobs but with relationship inside? Then it does't relate to the current problem?

@grahamwhaley
Copy link
Contributor

And the slave is online! Which means the agent has been injected/launched and connected back to the master!
http://jenkins.katacontainers.io/computer/arm01_slave/log

[07/24/18 08:28:10] [SSH] Starting slave process: cd "/home/jenkins" && java  -jar slave.jar
<===[JENKINS REMOTING CAPACITY]===>channel started
Remoting version: 3.21
This is a Unix agent
Evacuated stdout
Agent successfully connected and online

As you don't have nested VM, you will be able to run the static checks and I'm hoping also most of the unit tests ('go test'). I do have a feeling though that virtcontainers unit tests may not work without nesting - if that is the case we will have to adapt the test scripts to take that into account. It would be great if you can run all the other unit tests on ARM :-)

Running on bare-metal/in containers - have a look at kata-containers/ci#39 if you have not seen it already - I tried to detail where all the pain points might be (that I ran into when trying to do the same for the metrics CI, where I ended up running nested VMs at present).

#514 should fix how we can run the same scripts (https://github.com/kata-containers/tests/blob/master/.ci/jenkins_job_build.sh) on multiple architectures. Yes, we can start by having that only run a minimal set of tests for ARM and then expand that later.

My points around setting up a matrix and parallel jobs is about how we get Jenkins configured to launch ARM as well as x86 jobs when a PR lands or changes.

At present when a PR lands/changes Jenkins launches a number of parallel builds - Fedora, Ubuntu, Centos etc. - all on x86. What we need to configure is to add to that list the Arm variations are also launched.

The quick/dirty way is to just add another set of runtime/agent/proxy/shim jobs (to the list of jobs you can see on the CI homepage. That does not scale very well though. If we can set up Jenkins to have a 'matrix of parallel jobs to run', then this should scale better (and maybe ultimately be more maintainable for us) - but, it is a larger up-front cost as we do not have Jenkins configured that way at present.

I probably suggest we add a single new ARM job that targets one of our low bandwidth repos (the proxy maybe?), make that a 'non-required' CI item on github, and then we can test that out with you firing a test PR to that repo. Alternatively, we can tie that to a user repo where you can test locally without injecting noise into the main repo until it is up and running.

Let us know how you are progressing running setup.sh and then maybe then try to run static-checks.sh - which is what Travis runs at present, before moving on to jenkins_job_build.sh.

I do expect we may have to modify and add a new feature or two to the jenkins script and infrastructure to maybe only run the unit tests and skip virtcontainers tests if we find we are on a system that does not support nesting.

@chavafg
Copy link
Contributor

chavafg commented Jul 25, 2018

Agree with @grahamwhaley about adding a test job for the proxy repo, which does not have too much activity. Just like we did when experimenting with the zuul job.

@kalyxin02 once you think setup.sh can be executed in the arm node, I can create the job and add that step to see how it goes.

@kalyxin02
Copy link
Author

@grahamwhaley @chavafg Thanks for the connection. It's so great to see the Arm slave server are online even it's now idle. :-) I'm happy to hear that you all agree to run a minimal set of tests for ARM and then expand that later. Thanks!

We do pass the static check and most of the unit tests, except the issue we have raised for runtime repository unit tests at kata-containers/runtime#403. Not sure the current status of the issue, although it was marked as "close". It needs the modifications of x86 as well.

At present when a PR lands/changes Jenkins launches a number of parallel builds - Fedora, Ubuntu, Centos etc. - all on x86. What we need to configure is to add to that list the Arm variations are also launched.

Currently we only have one bare metal server which is Ubuntu based, so this is the only build we can have now, except we have multiple servers or we can find some way to have builds in container. For kata-containers/ci#39 you pointed out, I'm not quite understood why the problem happen as it looks like something wrong to add new PR to branch... and we're looking at them to find what happen there.

The static-checks.sh is no problem on Arm. For the setup.sh, at this stage, we have to skip the installation scripts of CNI, CRIO, Kubernetes, Openshift and comment off the nest-virt support. Then it can run on the Arm server. The modification was actually included in the 3 commits of PR#514. And according to the current feedback of upstream, we will update a new version today or tomorrow.

What should we do to add a test job for the proxy repo? Please let us know if you need something from us on Arm. Thanks!

@jodh-intel
Copy link
Contributor

Hi @kalyxin02 - thanks for the update! The PR associated with kata-containers/runtime#403 (kata-containers/runtime#414) has been merged.

It needs the modifications of x86 as well.

I'm not clear what you mean here?

Currently we only have one bare metal server which is Ubuntu based,

Is this environment using an LTS release (16.04 or 18.04)?

And according to the current feedback of upstream, we will update a new version today or tomorrow.

Great!

What should we do to add a test job for the proxy repo?

All you need to do is get your Jenkins to call the following scripts I think (@chavafg and @grahamwhaley might have more info though):

@grahamwhaley
Copy link
Contributor

Hi @kalyxin02 For Jenkins, let me go set up a proxy build job for Arm and tie it to your server (using a Jenkins label I expect). Then we can fire a test job/PR at the proxy repo and see how the Arm build goes. I think that is going to be the quickest way for us to see what works/fails and to push this forwards.

I will set the job up to look exactly like the x86 jobs - that is basically they call https://github.com/kata-containers/tests/blob/master/.ci/jenkins_job_build.sh.

If you know where to look ;-), then you can see the actual commands we run stored in the CI repo:
https://github.com/kata-containers/ci/blob/master/jenkins/jobs/kata-containers-proxy-ubuntu-16-04-PR/config.xml#L97-L108

#!/bin/bash

set -e

export ghprbPullId
export ghprbTargetBranch

cd $HOME
git clone https://github.com/kata-containers/tests.git
cd tests

.ci/jenkins_job_build.sh "github.com/kata-containers/proxy"

As you say, as you are running on a bare metal machine without the builds containerised inside a VM or other container type, then you may run into issues I detailed on kata-containers/ci#39 . This normally only happens when you get a particularly 'bad build' - say a PR that leaves a QEMU laying around, or corrupts docker or the runtime somehow. It is fairly rare, but it does happen. So, we will start with your server as bare metal, and then see how it goes.

I'll let you know here once I've set up the Jenkins job.

@grahamwhaley
Copy link
Contributor

@kalyxin02 - I have set up a proxy ARM build job on the Jenkins master, and filed a test PR over on the proxy.
I can see it has started to try and build - you can see here: http://jenkins.katacontainers.io/job/kata-containers-proxy-ARM-16.04-PR/1/

Let's keep an eye on that (I think we are expecting it to fail right now) - see what the first failure point is, and then I think it will be best for you to open a test PR on the proxy repo and start chasing down the failures - OK?

@grahamwhaley
Copy link
Contributor

OK, build failed as expected, but I suspect not in the way we expected. Here is the output in the console log:

GitHub pull request #92 of commit 35b95d21ff8157219271369e9e1c26943af93ab6, no merge conflicts.
Setting status of 35b95d21ff8157219271369e9e1c26943af93ab6 to PENDING with url http://jenkins.katacontainers.io/job/kata-containers-proxy-ARM-16.04-PR/1/ and message: 'Build running'
Using context: jenkins-ci-ARM-ubuntu-16-04
Building remotely on arm01_slave (arm_node ubuntu-1604) in workspace /home/jenkins/workspace/kata-containers-proxy-ARM-16.04-PR
Cloning the remote Git repository
Cloning repository https://github.com/kata-containers/proxy.git
 > git init /home/jenkins/workspace/kata-containers-proxy-ARM-16.04-PR # timeout=10
Fetching upstream changes from https://github.com/kata-containers/proxy.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/kata-containers/proxy.git +refs/heads/*:refs/remotes/origin/*
 > git config remote.origin.url https://github.com/kata-containers/proxy.git # timeout=10
 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
 > git config remote.origin.url https://github.com/kata-containers/proxy.git # timeout=10
Fetching upstream changes from https://github.com/kata-containers/proxy.git
 > git fetch --tags --progress https://github.com/kata-containers/proxy.git +refs/pull/92/*:refs/remotes/origin/pr/92/*
 > git rev-parse refs/remotes/origin/pr/92/merge^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/pr/92/merge^{commit} # timeout=10
Checking out Revision 233430e60d94bc765b7bb7eea22dafaf352b4257 (refs/remotes/origin/pr/92/merge)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 233430e60d94bc765b7bb7eea22dafaf352b4257
Commit message: "Merge 35b95d21ff8157219271369e9e1c26943af93ab6 into c416c9fc2ff8389c201075f0f26bfabd08a3b140"
First time build. Skipping changelog.
[kata-containers-proxy-ARM-16.04-PR] $ /bin/bash /tmp/jenkins9000578420986577799.sh
fatal: destination path 'tests' already exists and is not an empty directory.
Build step 'Execute shell' marked build as failure

I think the crucial line:

fatal: destination path 'tests' already exists and is not an empty directory.

is going to relate to the fact our Jenkins kickoff script (configured inside the Jenkins job) already did a git clone of the 'tests' repo before invoking the script - BUT - I thought the job itself would then run inside the Jenkins WORKDIR or somewhere.
This will be one of those niggles we only see when effectively 're-using' a build instance, whereas all current CIs run in 'clean instances' for each build.

This is probably not hard to fix. I'm not quite sure which line bombed out though...

Actually, @kalyxin02 - does your build machine happen to have a 'tests' directory in the jenkins user homedir that is not a git checkout of the kata tests repo? That might be making that initial git clone of the tests repo fail?

@grahamwhaley
Copy link
Contributor

A note for @chavafg - I had configured the ARM slave node with the labels 'arm_node' and 'ubuntu-1604' - but - then realised that all the x86 jobs only need the label 'ubuntu-1604', so Jenkins would start scheduling x86 jobs on the ARM node...
What I've done for now is give the ARM slave and build job the labels 'arm_node' and 'arm-ubuntu-1604'. That should pin only the ARM builds to only the ARM slave for now. We can then discuss if we should just stick with that 'arch prefix' on the distro labels, or if we want to add the x86 arch label to all the x86 build slaves and jobs (and, yes, there are quite a few of them as you know ;-) )

@grahamwhaley
Copy link
Contributor

And just one more thought. On the metrics CI (where we used to run on bare metal) my startup script is slightly different, which will probably avoid this 'tests dir exists' issue. I effectively have:

export GOPATH="${WORKSPACE}/go"
...
go get -d -u ${TESTS_REPO_NAME}
cd ${GOPATH}/src/${TESTS_REPO_NAME}
...
echo "Run the Jenkins build script"
$(pwd)/.ci/jenkins_job_build.sh "${REPO_NAME}"

So, run each new build within the 'clean' $WORKSPACE dir that is set by Jenkins on a per-job basis.
Let me go and tweak the ARM slave right now to effectively do that, and we'll see if it gets further. When @chavafg is available then we can discuss how we want to finally end up with this configured.

@grahamwhaley
Copy link
Contributor

OK, I updated the job script to be:

#!/bin/bash

# As we have set the shell to /bin/bash, we need to reset -e (and let's set -x as well eh)
set -ex

# These vars are required by the .ci scripts in the repos
export ghprbPullId
export ghprbTargetBranch

export REPO_NAME="github.com/kata-containers/proxy"
export TESTS_REPO_NAME="github.com/kata-containers/tests"
export TESTS_GIT_NAME="https://github.com/kata-containers/tests.git"
export GOPATH="${WORKSPACE}/go"

#go get -d -u ${TESTS_REPO_NAME}
git clone ${TESTS_GIT_NAME} ${GOPATH}/src/${TESTS_REPO_NAME}
cd ${GOPATH}/src/${TESTS_REPO_NAME}

.ci/jenkins_job_build.sh "${REPO_NAME}"

kicked off a rebuild, and now in the console logs we have from http://jenkins.katacontainers.io/job/kata-containers-proxy-ARM-16.04-PR/2/console :

...
INFO: Go 1.10 already installed
sudo: no tty present and no askpass program specified
manage_ctr_mgr.sh - INFO: docker_info: version: 
manage_ctr_mgr.sh - INFO: docker_info: default runtime: 
manage_ctr_mgr.sh - INFO: docker_info: package name: docker-ce
<13>Jul 31 12:48:33 manage_ctr_mgr.sh: configure_docker: configuring runc as default docker runtime
ls: cannot access '/etc/systemd/system/docker.service.d/': No such file or directory
ls: cannot access '/etc/systemd/system/docker.service.d/': No such file or directory
Stopping the docker service
sudo: no tty present and no askpass program specified
Build step 'Execute shell' marked build as failure

That got a bit further. Seems we have some sudo and perms things to sort out. Note @kalyxin02 - Jenkins does not provide a tty! You can replicate this locally by hand by running under a nohup iirc - that is what I used to sort some similar issues out a while back.

OK, let me know if/when you need anything more from myself or @chavafg

@kalyxin02
Copy link
Author

Hi @jodh-intel,
for kata-containers/runtime#403, what we mean is the error occurs when running make test locally - This error @grahamwhaley also noticed and you could find his comments about the error.

testlog.txt: file already closed

For your second question, the Arm slave machine is running Ubuntu 18.04 LTS.
And thanks for your suggestion about running the scripts under proxy repository.

@grahamwhaley
Copy link
Contributor

grahamwhaley commented Aug 1, 2018

Arm slave machine is running Ubuntu 18.04 LTS.

OK, at some point either myself or @chavafg will have to rename the '1604' arm slave on Jenkins QA CI. It won't affect functionality, but we should name it correctly....

@kalyxin02
Copy link
Author

Hi, @grahamwhaley, thanks for your efforts to start the try of CI jobs and your updates of the scripts to make a clean build. I'm trying to figure out how to move forward a bit based on your steps and will update you once I got something. Thanks!

@kalyxin02
Copy link
Author

Hi, @grahamwhaley, I'm a bit confused. kata-containers/proxy#94 trigger the CI job on task today, I still saw the below errors...

fatal: destination path '/home/jenkins/workspace/kata-containers-proxy-ARM-16.04-PR/go/src/github.com/kata-containers/tests' already exists and is not an empty directory.

@grahamwhaley
Copy link
Contributor

Hi @kalyxin02 - indeed. OK, I see that the build does not happen in the tmpdir, but always happens in the same fixed workspace.
Something we can do is to tick the 'delete workspace before/after build' box in the Jenkins job - that will help ensure we have at least a clean directory tree.
Having done this before in the metrics CI ;-), I will also add a chmod/chgrp -R jenkins ${WORKSPACE} into the script - as iirc some of our tests run under sudo etc., and leave some root owned files in the build tree, that then make the 'delete workspace' fail if we do not fix the perms first.

Let me go make those changes, and then I will nudge a rebuild and we'll see what happens.

@grahamwhaley
Copy link
Contributor

OK, looks like we are back to like before

INFO: Go 1.10 already installed
sudo: no tty present and no askpass program specified
manage_ctr_mgr.sh - INFO: docker_info: version: 
manage_ctr_mgr.sh - INFO: docker_info: default runtime: 
manage_ctr_mgr.sh - INFO: docker_info: package name: docker-ce
<13>Aug  2 10:57:13 manage_ctr_mgr.sh: configure_docker: configuring runc as default docker runtime
ls: cannot access '/etc/systemd/system/docker.service.d/': No such file or directory
ls: cannot access '/etc/systemd/system/docker.service.d/': No such file or directory
Stopping the docker service
sudo: no tty present and no askpass program specified

One suggestion maybe @kalyxin02 - is the jenkins user on that slave able to do passwordless sudo? It needs to to run some of the scripts/commands.

@GabyCT
Copy link
Contributor

GabyCT commented Nov 27, 2019

@kalyxin02 ARM CI is already running, can we close this issue? thanks

@GabyCT GabyCT closed this as completed May 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants