-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test-osuosl-aix72-ppc64_be-3 build failures #2718
Comments
I had to clean up a bunch of leftover gmake/gcc processes on that machine on Thursday while preparing the releases. Given that https://ci.nodejs.org/job/node-test-commit-aix/37790/nodes=aix72-ppc64/ is encountering I won't be able to log into the machine until late Sunday night/Monday -- supposedly we have a playbook runnable from our AWX server that is able to clean up processes (https://github.com/nodejs/build/blob/master/ansible/playbooks/jenkins/worker/restart-agent.yml) but I don't actually see any templates under resources as I did before as documented in #2714 (cc @AshCripps). |
Looks like |
hmm. I just logged into
|
I attempted to run the Jenkins worker playbook against
while https://github.com/AdoptOpenJDK/openjdk8-binaries/releases/tag/jdk8u292-b10_openj9-0.26.0 is the latest. Unfortunately I got an error running the playbook:
which is because Ansible has failed to find
We saw a similar issue before for |
#2720 to fix the
|
Current theory is that the instability is coming from
Where |
We've had a reasonable number of green builds on -3 but also some failures. For example on https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/37855/ and |
There was a suggestion to wipe out the agent jar and reinstall it -- I've done so (delete, rerun the Jenkins worker create playbook and the restart agent playbook) but we've already got the Java exceptions 😞. |
nio disconnection in https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/37887/console on -2 😞 . |
We appear to have had nio disconnects as far back as over a month ago, e.g. https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/37311/ (1 month 7 days ago on It does feel like the disconnects have become more frequent recently. |
Dear all,
The explanation is that the frame has dual power supplies, but both are
on the same feed. The UPC has battery issues (was known, and new
ordered, but they won't be available before mid-September.
I have asked the DC management to move the system to a rack with dual
feeds. As soon as there is more info on this I'll update here and/or via
Slack (to Ash).
regards,
Michael
p.s. - please check your "lost+found" directories for any
files/directories. If present, they may be taking up space and/or give
hints on what needs further attention.
…On 05/08/2021 18:00, Richard Lau wrote:
We appear to have had nio disconnects as far back as over a month ago,
e.g.
https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/37311/
<https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/37311/>
(1 month 7 days ago on |-3|). Not sure about earlier than that as we
don't have the Jenkins logs for the earlier failed builds.
It does feel like the disconnects have become more frequent recently.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#2718 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACSZR5OKYPVYPO37Q7L2IELT3KYSNANCNFSM5BKNZBIQ>.
|
Another nio disconnect: https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/38061/ occurred two days ago on Ran the |
We've got this in the Jenkins console log on the machine corresponding to the NIO disconnect in https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/38159/console:
The first exception is the nio disconnect... the second |
Initial searches for |
We've had five disconnects in the last six days (since I last cleaned up). Captured logs before running restart-agent.
|
Six more disconnects since the previous restart on
|
I'm trying out running the Jenkins agent with Java 11 on |
The labels seem to have been renewed (be-3 bit is new to me, so let me figure out which system this really is, and I see it has had this since Aug. 1st. The system may be suffering from the crash still, maybe a refresh os the system is still needed.). |
@aixtools the ip should be: |
We had one disconnect and subsequent NoClassDefFoundError in the last week on
Trying Java 17. |
Unfortunately we still get the two problems with Java 17. |
Curiously the first disconnect on Oct 8 didn't leave any child processes behind but when I logged in just now there were child processes running but no (according to Jenkins) build in progress (note the 13:23 timestamps that correlate with the disconnect on Oct 10):
|
Well, if the details above are compareable with the disconnects on the other systems - it might be safe to say that the disconnects are not related to side-effects left over from the POWER outage. Is it possible to go to an old version of java (pre-August) and see if the agent is stable again (and/or check the ci server admin logs for changes made in the last week of July, 1st week of August). Maybe what you see with AIX is just a symptom of something created elsewhere. |
I've implemented a new Jenkins job (restricted currently to the Build WG) that runs |
We're now seeing the disconnects on every build #2872 😞. |
I'm getting some slowness on interactive shells to OSUOSL AIX machines. For now I've added a job that's doing a ping every 20 seconds to see if that shows any connectivity issues which is writing to |
@sxa It would appear that the current issue we have is with Jenkins' "Ping thread" (https://www.jenkins.io/doc/book/system-administration/monitoring/#ping-thread) which is causing the Jenkins server to close down the agent on the OSUOSL AIX instances as it doesn't get a response within the ping timeout (both with 120s that we were using before and the default 240s). I've temporarily disabled the ping thread and restarted the AIX agents and so far the builds look like they're progressing much better (I got a Node.js 16 build through at least 🎉 https://ci.nodejs.org/job/node-test-commit-aix/40040/). Possible issues with connectivity/network throughput could certainly explain the pings from the Jenkins server not being responded to. |
Looking a lot more green now which is good. Possibly one to alter @aixtools to about when he's back next week as it's looking like network connectivity outages. I've been having channel shutdowns on OSUOSL AIX boxes at Adoptium too: adoptium/infrastructure#2473 (Slightly different messages, although those are connecting in from the master over ssh instead of starting over jnlp) |
test-osuosl-aix72-ppc64_be-3 is failing repeatedly so I marked it offline.
Refs: nodejs/node#39604 (comment)
@nodejs/platform-aix
The text was updated successfully, but these errors were encountered: