-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Machines keep going offline during builds #232
Comments
Thing is, they don't. I've been logged into the slaves while the timeout occurred. There's also:
..so I'm suspecting either jenkins (yay) or perhaps even the ci host. |
minor update: I've updated most of our slaves to 1.52 -- seems like it didn't help; I've seen windows slaves go offline since. |
fwiw @joaocgreis mentioned that the machines on azure had some weird networking problem that was causing the on/off behaviour, they are unique in this respect, judging by the CI status emails we're getting anyway. |
@rvagg lets assume that was the case then. i'll try and keep a close look on fails over the next few days. |
I'm starting to think the host is to blame. Should we try updating jenkins? Can't find anything relevant in the changelog. |
I've had connection problems very frequently when I was setting up the cross compiler machine (running Linux). I changed the connection to ssh from Jenkins and haven't seen it fail since. I've installed Cygwin on It's strange that the machines in Azure are constantly having this problem, but the ones on Rackspace are always fine. |
Here's another one from arm slaves: https://ci.nodejs.org/job/OLD-node-test-binary-arm/414/RUN_SUBSET=3,nodes=pi1-raspbian-wheezy/console |
I might have figured out the problem with Azure machines. Jenkins slave has a keep alive signal with a 5 minutes default interval. That seems to be too much for Azure, the the connections were broken because of that. I added a JVM option to Azure slaves to reduce it to 2 minutes (that's what Azure uses for SSH). Let's see if that's the correct fix for the correct problem, but I'm hopeful. On the other hand, Jenkins has been completely broken since https://ci.nodejs.org/job/node-test-commit/1107/ . Apparently, sub jobs are being started only if it detects any change in git, even though that option is explicitly disabled everywhere. Right now, my best guess is that some plugin update broke it. The multijob plugin was updated (does it have automatic updates?) to 1.19, that introduced the "Resume build" button, and that button appears for the first time in the first build with problems. This might be a coincidence, I'm still looking into it, this is just to share progress. |
@joaocgreis I've noticed a similar problem on my own multi-jobs. I had to enable the "Only build when VCS changes are detected" because it doesn't actually mean what it says. |
I downgraded the multijob plugin to 1.18 and it's building, looks good so far. I'd rather leave it at 1.18 instead of flipping all the "build only if VCS" checkboxes because we have quite a few. That issue is 8 hours old, perhaps the fix won't take too long. |
The test-binary jobs did not work after downgrading the multijob plugin, had to upgrade again and flip all the switches. We'll probably have to flip them again when this gets fixed.
|
I haven't seen Azure machines failing again, so I assume the keep alive interval change fixed it. As for jenkins, jobs seem to be running well now. So, keeping this issue alive is the (much fewer) random failures not tied to a specific set of slaves. Are those still happening? |
I think we're improving on all fronts 👍 |
We haven't seen disconnects for a long while. Very good news! Lets close this and sleep better at night, hoping it won't be reopened. I guess the bad part is that we didn't really identify a few of the issues as to why they disconnected, but it's pretty much established that lowering the ping interval between master and slaves did a lot. |
https://ci.nodejs.org/job/node-test-commit-plinux/180/ is an example of it
The text was updated successfully, but these errors were encountered: