Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨🚨🚨🚨 CI IS DOWN!!!! 🚨🚨🚨🚨 #825

Closed
MylesBorins opened this issue Aug 9, 2017 · 16 comments
Closed

🚨🚨🚨🚨 CI IS DOWN!!!! 🚨🚨🚨🚨 #825

MylesBorins opened this issue Aug 9, 2017 · 16 comments

Comments

@MylesBorins
Copy link
Contributor

Can anyone help?

@MylesBorins
Copy link
Contributor Author

/cc @jbergstroem @rvagg @joaocgreis

@rvagg
Copy link
Member

rvagg commented Aug 9, 2017

on it

@gibfahn
Copy link
Member

gibfahn commented Aug 9, 2017

Seems back (presumably thanks to @rvagg ).

@rvagg
Copy link
Member

rvagg commented Aug 9, 2017

Fixed and it's up and running

@nodejs/build Disk space problem yet again, but this time I dug a bit deeper and found that it's the workspaces, not the job data, that's causing us most grief with disk. Every time a job is run the master does the initial clone to manage the process but then we end up with a lot of clones of some big repos and Jenkins is pretty messy about it, making multiple workspaces, even ones with tmp in their name.

Unfortunately it's not obvious to me how we could clean these up automatically. We could schedule a cron and delete but we don't want to be deleting workspaces that are in use. A "last modified" check might do the trick I suppose. I believe that Jenkins doesn't keep internal state about the workspaces, they are just files on disk to be touched whenever.

Another option is to shunt this work off onto another host, a secondary, that the master uses for all of these workspaces. We have a rule in there that forces this work to be done on the master rather than some random node that's connected (that's the default). I think we could connect a secondary server with a really big disk as a slave node and have it do all of this workspace stuff, leaving the master to manage job coordination. That may have an additional side benefit of making the master more efficient and possibly faster (just a guess).

@rvagg
Copy link
Member

rvagg commented Aug 9, 2017

100% down to 15% just by deleting workspaces FYI. find /var/lib/jenkins/jobs/ -name workspace\* -exec '{}' \; should do the trick for anyone who might need to do this in future.

@gibfahn
Copy link
Member

gibfahn commented Aug 9, 2017

Moving git clones off master (second suggestion) sounds like a good idea, running Jenkins is more than enough for one machine in my experience.

@rvagg
Copy link
Member

rvagg commented Aug 9, 2017

Also, @nodejs/build, it may take a bit of coaxing to get all of these nodes reconnected. Some of them may not be retrying so any help in getting them back online will be appreciated.

@gibfahn
Copy link
Member

gibfahn commented Aug 9, 2017

@rvagg a list of the ones you expect to be online would be useful, trying to run test-softlayer-centos6-x64-2 and test-softlayer-centos6-x64-1 and getting this:

Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: test-softlayer-centos6-x64-2
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [https://ci.nodejs.org/]
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to ci.nodejs.org:41913
Aug 9, 2017 4:55:20 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP2-connect
Aug 9, 2017 4:55:21 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Aug 9, 2017 4:55:21 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated

I tried downloading a new slave.jar from https://ci.nodejs.org/computer/test-softlayer-centos6-x64-2/, but it gives this error:

Exception in thread "main" java.lang.UnsupportedClassVersionError: hudson/remoting/Launcher : Unsupported major.minor version 51.0
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:648)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
	at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:206)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:325)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:296)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:270)
	at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:406)

Do we expect this machine to connect?

@gibfahn
Copy link
Member

gibfahn commented Aug 9, 2017

Okay, everything is back except for the Pis and these machines:

Also one of the macs is running out of space: Fixed (for now)

@rvagg
Copy link
Member

rvagg commented Aug 9, 2017

Yes, I'm pretty sure that machine was working last week, I was working in the other centos6-x64 which was offline but this one was still fine.

What JVM is it using? That error sucks because there's no clear way to fix it. Just make sure you have the slave.jar from ci.nodejs.org and an updated JVM. I'll get on soon and see what I can do if you can't make headway.

Regarding which machines should be online - all of them, unless you can't SSH in, most of the ones not in the ARM cluster are good, I have a bunch of pi's offline though.

@rvagg
Copy link
Member

rvagg commented Aug 9, 2017

test-mininodes-ubuntu1604-arm64_odroid_c2-1 needs a restart, it developed problems yesterday. I'm getting David @ miniNodes to deal with it.

test-digitalocean-freebsd10-x64-1 is an interesting one, I was trying to get it online on the weekend but failed - I've tried hard-rebooting it to no avail, I can get the web console open via digitalocean and it even responds (I can't login there of course). So it looks like a network problem. @jbergstroem should we just reprovision this machine? Is ansible OK with these in its current form? I've never done a freebsd provision before.

Working on the centos6 machines now.

@joaocgreis
Copy link
Member

Both centos6-64 back, service jenkins restart did the trick. I don't see anything in place to restart when Jenkins crashes (no monit or systemd), so if nothing is really there we should add it.

@rvagg
Copy link
Member

rvagg commented Aug 9, 2017

cleaned up a few more hosts too, looks like we're back on track now except for the freebsd10

@rvagg rvagg closed this as completed Aug 9, 2017
@refack
Copy link
Contributor

refack commented Aug 10, 2017

I think the aix failures in CitGM are related to the restart:
https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/947/nodes=aix61-ppc64/console
(all packages fail to either download or install)
ping @mhdawson @gibfahn

@mhdawson
Copy link
Member

I cleaned up some old processes and restarted the jenkins agent. A lot of the test ran but still a bunch of failures. What I can't tell is if this is different from before as citgm has lots of red overall.

https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/nodes=aix61-ppc64/952/consoleFull

@rvagg
Copy link
Member

rvagg commented Aug 15, 2017

@nodejs/build - I've just made some major changes to the way CI executes

  • There are two new hosts, test-packetnet-ubuntu1606-x64-1 and test-packetnet-ubuntu1606-x64-1 and they have the label "jenkins-workspace" and can handle 20 parallel executors at a time. They are 4 core Atom servers (i.e. not huge but seem to perform quite nicely) but have 1Tb of attached storage on /home/. They are setup on the server like normal test hosts but on CI they are set up as generic workers.
  • I've turned master into a "run only when specifically selected" and turned it to a single parallel executor (I wouldn't mind doing a 0).
  • I've reconfigured a bunch of jobs (I've tried to do as many active ones as possible) so that they have "Restrict the nodes this job can be executed on" to be "jenkins-workspace" which means they do their management on those hosts, that's git clone, main script execution and coordination, not actual test running. There are some jobs that run specifically on other hosts, like benchmark and lint, so they aren't changed. But for the most part, jobs that have child nodes that execute the tests/build should execute the main coordination on one of these new hosts.
  • Cleaned out workspace*/ directories from the master host again

It's possible that we may have some job configuration problems from this, so if things seem to be failing with whacky reasons then this could be the cause. There may be more ironing out to do. But this should take a big load off master and should take the disk pressure off there too so we should even be able to extend the number of days we retain data up from the current 5 (or 7, I don't recall what it was when I last looked).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants