🚨🚨🚨🚨 CI IS DOWN!!!! 🚨🚨🚨🚨 #825

MylesBorins · 2017-08-09T19:40:03Z

Can anyone help?

MylesBorins · 2017-08-09T20:28:23Z

/cc @jbergstroem @rvagg @joaocgreis

rvagg · 2017-08-09T21:23:49Z

on it

gibfahn · 2017-08-09T21:39:28Z

Seems back (presumably thanks to @rvagg ).

rvagg · 2017-08-09T21:42:05Z

Fixed and it's up and running

@nodejs/build Disk space problem yet again, but this time I dug a bit deeper and found that it's the workspaces, not the job data, that's causing us most grief with disk. Every time a job is run the master does the initial clone to manage the process but then we end up with a lot of clones of some big repos and Jenkins is pretty messy about it, making multiple workspaces, even ones with tmp in their name.

Unfortunately it's not obvious to me how we could clean these up automatically. We could schedule a cron and delete but we don't want to be deleting workspaces that are in use. A "last modified" check might do the trick I suppose. I believe that Jenkins doesn't keep internal state about the workspaces, they are just files on disk to be touched whenever.

Another option is to shunt this work off onto another host, a secondary, that the master uses for all of these workspaces. We have a rule in there that forces this work to be done on the master rather than some random node that's connected (that's the default). I think we could connect a secondary server with a really big disk as a slave node and have it do all of this workspace stuff, leaving the master to manage job coordination. That may have an additional side benefit of making the master more efficient and possibly faster (just a guess).

rvagg · 2017-08-09T21:43:18Z

100% down to 15% just by deleting workspaces FYI. find /var/lib/jenkins/jobs/ -name workspace\* -exec '{}' \; should do the trick for anyone who might need to do this in future.

gibfahn · 2017-08-09T21:43:57Z

Moving git clones off master (second suggestion) sounds like a good idea, running Jenkins is more than enough for one machine in my experience.

rvagg · 2017-08-09T21:44:17Z

Also, @nodejs/build, it may take a bit of coaxing to get all of these nodes reconnected. Some of them may not be retrying so any help in getting them back online will be appreciated.

gibfahn · 2017-08-09T21:57:47Z

@rvagg a list of the ones you expect to be online would be useful, trying to run test-softlayer-centos6-x64-2 and test-softlayer-centos6-x64-1 and getting this:

Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: test-softlayer-centos6-x64-2
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [https://ci.nodejs.org/]
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to ci.nodejs.org:41913
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP2-connect
Aug 9, 2017 4:55:21 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Aug 9, 2017 4:55:21 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated

I tried downloading a new slave.jar from https://ci.nodejs.org/computer/test-softlayer-centos6-x64-2/, but it gives this error:

Exception in thread "main" java.lang.UnsupportedClassVersionError: hudson/remoting/Launcher : Unsupported major.minor version 51.0
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:648)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
	at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:206)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:325)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:296)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:270)
	at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:406)

Do we expect this machine to connect?

gibfahn · 2017-08-09T22:10:46Z

Okay, everything is back except for the Pis and these machines:

test-mininodes-ubuntu1604-arm64_odroid_c2-1 - can't ssh in
test-digitalocean-freebsd10-x64-1 - can't ssh in
test-softlayer-centos6-x64-1 - error mentioned above
test-softlayer-centos6-x64-2 - error mentioned above

~~Also one of the macs is running out of space:~~ Fixed (for now)

test-requireio-osx1010-x64-1

rvagg · 2017-08-09T22:12:22Z

Yes, I'm pretty sure that machine was working last week, I was working in the other centos6-x64 which was offline but this one was still fine.

What JVM is it using? That error sucks because there's no clear way to fix it. Just make sure you have the slave.jar from ci.nodejs.org and an updated JVM. I'll get on soon and see what I can do if you can't make headway.

Regarding which machines should be online - all of them, unless you can't SSH in, most of the ones not in the ARM cluster are good, I have a bunch of pi's offline though.

rvagg · 2017-08-09T22:49:22Z

test-mininodes-ubuntu1604-arm64_odroid_c2-1 needs a restart, it developed problems yesterday. I'm getting David @ miniNodes to deal with it.

test-digitalocean-freebsd10-x64-1 is an interesting one, I was trying to get it online on the weekend but failed - I've tried hard-rebooting it to no avail, I can get the web console open via digitalocean and it even responds (I can't login there of course). So it looks like a network problem. @jbergstroem should we just reprovision this machine? Is ansible OK with these in its current form? I've never done a freebsd provision before.

Working on the centos6 machines now.

joaocgreis · 2017-08-09T22:51:09Z

Both centos6-64 back, service jenkins restart did the trick. I don't see anything in place to restart when Jenkins crashes (no monit or systemd), so if nothing is really there we should add it.

rvagg · 2017-08-09T23:06:01Z

cleaned up a few more hosts too, looks like we're back on track now except for the freebsd10

refack · 2017-08-10T11:22:50Z

I think the aix failures in CitGM are related to the restart:
https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/947/nodes=aix61-ppc64/console
(all packages fail to either download or install)
ping @mhdawson @gibfahn

mhdawson · 2017-08-14T22:58:42Z

I cleaned up some old processes and restarted the jenkins agent. A lot of the test ran but still a bunch of failures. What I can't tell is if this is different from before as citgm has lots of red overall.

https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/nodes=aix61-ppc64/952/consoleFull

rvagg · 2017-08-15T02:19:32Z

@nodejs/build - I've just made some major changes to the way CI executes

There are two new hosts, test-packetnet-ubuntu1606-x64-1 and test-packetnet-ubuntu1606-x64-1 and they have the label "jenkins-workspace" and can handle 20 parallel executors at a time. They are 4 core Atom servers (i.e. not huge but seem to perform quite nicely) but have 1Tb of attached storage on /home/. They are setup on the server like normal test hosts but on CI they are set up as generic workers.
I've turned master into a "run only when specifically selected" and turned it to a single parallel executor (I wouldn't mind doing a 0).
I've reconfigured a bunch of jobs (I've tried to do as many active ones as possible) so that they have "Restrict the nodes this job can be executed on" to be "jenkins-workspace" which means they do their management on those hosts, that's git clone, main script execution and coordination, not actual test running. There are some jobs that run specifically on other hosts, like benchmark and lint, so they aren't changed. But for the most part, jobs that have child nodes that execute the tests/build should execute the main coordination on one of these new hosts.
Cleaned out workspace*/ directories from the master host again

It's possible that we may have some job configuration problems from this, so if things seem to be failing with whacky reasons then this could be the cause. There may be more ironing out to do. But this should take a big load off master and should take the disk pressure off there too so we should even be able to extend the number of days we retain data up from the current 5 (or 7, I don't recall what it was when I last looked).

gibfahn added the ci-public label Aug 9, 2017

rvagg closed this as completed Aug 9, 2017

joaocgreis mentioned this issue Aug 17, 2017

CI for llnode? #777

Closed

refack added the incident label Oct 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚨🚨🚨🚨 CI IS DOWN!!!! 🚨🚨🚨🚨 #825

🚨🚨🚨🚨 CI IS DOWN!!!! 🚨🚨🚨🚨 #825

MylesBorins commented Aug 9, 2017

MylesBorins commented Aug 9, 2017

rvagg commented Aug 9, 2017

gibfahn commented Aug 9, 2017

rvagg commented Aug 9, 2017

rvagg commented Aug 9, 2017

gibfahn commented Aug 9, 2017

rvagg commented Aug 9, 2017

gibfahn commented Aug 9, 2017 •

edited

Loading

gibfahn commented Aug 9, 2017 •

edited

Loading

rvagg commented Aug 9, 2017

rvagg commented Aug 9, 2017

joaocgreis commented Aug 9, 2017

rvagg commented Aug 9, 2017

refack commented Aug 10, 2017

mhdawson commented Aug 14, 2017

rvagg commented Aug 15, 2017

🚨🚨🚨🚨 CI IS DOWN!!!! 🚨🚨🚨🚨 #825

🚨🚨🚨🚨 CI IS DOWN!!!! 🚨🚨🚨🚨 #825

Comments

MylesBorins commented Aug 9, 2017

MylesBorins commented Aug 9, 2017

rvagg commented Aug 9, 2017

gibfahn commented Aug 9, 2017

rvagg commented Aug 9, 2017

rvagg commented Aug 9, 2017

gibfahn commented Aug 9, 2017

rvagg commented Aug 9, 2017

gibfahn commented Aug 9, 2017 • edited Loading

gibfahn commented Aug 9, 2017 • edited Loading

rvagg commented Aug 9, 2017

rvagg commented Aug 9, 2017

joaocgreis commented Aug 9, 2017

rvagg commented Aug 9, 2017

refack commented Aug 10, 2017

mhdawson commented Aug 14, 2017

rvagg commented Aug 15, 2017

gibfahn commented Aug 9, 2017 •

edited

Loading

gibfahn commented Aug 9, 2017 •

edited

Loading