Machines keep going offline during builds #232

evanlucas · 2015-10-29T18:00:54Z

https://ci.nodejs.org/job/node-test-commit-plinux/180/ is an example of it

jbergstroem · 2015-10-30T00:02:41Z

Thing is, they don't. I've been logged into the slaves while the timeout occurred. There's also:

..so I'm suspecting either jenkins (yay) or perhaps even the ci host.

jbergstroem · 2015-10-30T02:26:42Z

and more https://ci.nodejs.org/job/node-test-binary-windows/RUN_SUBSET=0,VS_VERSION=vs2013,label=win2008r2/170/console

jbergstroem · 2015-11-01T22:53:11Z

minor update: I've updated most of our slaves to 1.52 -- seems like it didn't help; I've seen windows slaves go offline since.

rvagg · 2015-11-01T23:15:56Z

fwiw @joaocgreis mentioned that the machines on azure had some weird networking problem that was causing the on/off behaviour, they are unique in this respect, judging by the CI status emails we're getting anyway.

jbergstroem · 2015-11-01T23:47:23Z

@rvagg lets assume that was the case then. i'll try and keep a close look on fails over the next few days.

jbergstroem · 2015-11-02T22:58:04Z

Still around: https://ci.nodejs.org/job/node-test-commit-plinux/nodes=ppcbe-fedora20/198/console

jbergstroem · 2015-11-03T00:55:02Z

https://ci.nodejs.org/job/node-test-commit-plinux/nodes=ppcle-ubuntu1404/200/console

jbergstroem · 2015-11-03T00:55:43Z

https://ci.nodejs.org/job/node-test-commit-linux/nodes=ubuntu1504-64/1111/console

jbergstroem · 2015-11-03T00:56:03Z

I'm starting to think the host is to blame. Should we try updating jenkins? Can't find anything relevant in the changelog.

joaocgreis · 2015-11-03T03:35:54Z

I've had connection problems very frequently when I was setting up the cross compiler machine (running Linux). I changed the connection to ssh from Jenkins and haven't seen it fail since.

I've installed Cygwin on node-msft-win10-5 to try ssh to Windows (it should work), but no luck connecting so far.

It's strange that the machines in Azure are constantly having this problem, but the ones on Rackspace are always fine.

jbergstroem · 2015-11-04T19:43:21Z

Here's another one from arm slaves: https://ci.nodejs.org/job/OLD-node-test-binary-arm/414/RUN_SUBSET=3,nodes=pi1-raspbian-wheezy/console

joaocgreis · 2015-11-13T18:25:45Z

I might have figured out the problem with Azure machines. Jenkins slave has a keep alive signal with a 5 minutes default interval. That seems to be too much for Azure, the the connections were broken because of that. I added a JVM option to Azure slaves to reduce it to 2 minutes (that's what Azure uses for SSH). Let's see if that's the correct fix for the correct problem, but I'm hopeful.

On the other hand, Jenkins has been completely broken since https://ci.nodejs.org/job/node-test-commit/1107/ . Apparently, sub jobs are being started only if it detects any change in git, even though that option is explicitly disabled everywhere.

Right now, my best guess is that some plugin update broke it. The multijob plugin was updated (does it have automatic updates?) to 1.19, that introduced the "Resume build" button, and that button appears for the first time in the first build with problems. This might be a coincidence, I'm still looking into it, this is just to share progress.

rmg · 2015-11-13T18:34:05Z

@joaocgreis I've noticed a similar problem on my own multi-jobs. I had to enable the "Only build when VCS changes are detected" because it doesn't actually mean what it says.

See https://issues.jenkins-ci.org/browse/JENKINS-30952

joaocgreis · 2015-11-13T18:45:57Z

I downgraded the multijob plugin to 1.18 and it's building, looks good so far. I'd rather leave it at 1.18 instead of flipping all the "build only if VCS" checkboxes because we have quite a few. That issue is 8 hours old, perhaps the fix won't take too long.

joaocgreis · 2015-11-13T19:25:06Z

The test-binary jobs did not work after downgrading the multijob plugin, had to upgrade again and flip all the switches. We'll probably have to flip them again when this gets fixed.

~~But they still don't work: https://ci.nodejs.org/job/node-test-binary-arm/482/console and https://ci.nodejs.org/job/node-test-binary-windows/284/console~~ EDIT: I cloned the jobs to clear the history, they seem to be working now.

joaocgreis · 2015-11-17T23:38:30Z

I haven't seen Azure machines failing again, so I assume the keep alive interval change fixed it. As for jenkins, jobs seem to be running well now.

So, keeping this issue alive is the (much fewer) random failures not tied to a specific set of slaves. Are those still happening?

jbergstroem · 2015-11-17T23:51:46Z

I think we're improving on all fronts 👍

jbergstroem · 2015-11-26T02:30:15Z

We haven't seen disconnects for a long while. Very good news! Lets close this and sleep better at night, hoping it won't be reopened. I guess the bad part is that we didn't really identify a few of the issues as to why they disconnected, but it's pretty much established that lowering the ping interval between master and slaves did a lot.

jbergstroem changed the title ~~PPC machines keep going offline during builds~~ Machines keep going offline during builds Oct 30, 2015

This was referenced Nov 17, 2015

Should there be a Testing WG? nodejs/node#3872

Closed

create ansible script for jenkins host #243

Closed

orangemocha mentioned this issue Nov 24, 2015

Jenkins MultiJob plug-in issues #265

Closed

jbergstroem closed this as completed Nov 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machines keep going offline during builds #232

Machines keep going offline during builds #232

evanlucas commented Oct 29, 2015

jbergstroem commented Oct 30, 2015

jbergstroem commented Oct 30, 2015

jbergstroem commented Nov 1, 2015

rvagg commented Nov 1, 2015

jbergstroem commented Nov 1, 2015

jbergstroem commented Nov 2, 2015

jbergstroem commented Nov 3, 2015

jbergstroem commented Nov 3, 2015

jbergstroem commented Nov 3, 2015

joaocgreis commented Nov 3, 2015

jbergstroem commented Nov 4, 2015

joaocgreis commented Nov 13, 2015

rmg commented Nov 13, 2015

joaocgreis commented Nov 13, 2015

joaocgreis commented Nov 13, 2015

joaocgreis commented Nov 17, 2015

jbergstroem commented Nov 17, 2015

jbergstroem commented Nov 26, 2015

Machines keep going offline during builds #232

Machines keep going offline during builds #232

Comments

evanlucas commented Oct 29, 2015

jbergstroem commented Oct 30, 2015

jbergstroem commented Oct 30, 2015

jbergstroem commented Nov 1, 2015

rvagg commented Nov 1, 2015

jbergstroem commented Nov 1, 2015

jbergstroem commented Nov 2, 2015

jbergstroem commented Nov 3, 2015

jbergstroem commented Nov 3, 2015

jbergstroem commented Nov 3, 2015

joaocgreis commented Nov 3, 2015

jbergstroem commented Nov 4, 2015

joaocgreis commented Nov 13, 2015

rmg commented Nov 13, 2015

joaocgreis commented Nov 13, 2015

joaocgreis commented Nov 13, 2015

joaocgreis commented Nov 17, 2015

jbergstroem commented Nov 17, 2015

jbergstroem commented Nov 26, 2015