-
-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up jenkins nodes which are not contactable over ssh #3486
Comments
A number of these are the ones hosted on the equinix machines which are to be decommissioned as part of #3292:
I guess the docker images have been shut down on those hosts as well as being marked offline which is why jenkins is still trying to connect to them. |
It looks like many of the ones with just one entry in the log are ones that have been marked offline in the jenkins UI. The following are on dockerhost.dockerhost-equinix-ubuntu2004-x64-1 and can now be decommissioned - the machines on the ubuntu2204 have all been removed already:
These have all now been removed from jenkins. |
Remaining test-docker machines that are not contactable:
@Haroon-Khel Do you know why the ones marked Altra (dockerhost-equinix Arm64 systems) and the Azure ones here are offline - is that expected? |
I've removed the alibaba machines from jenkins. They are still in the inventory file for now.
Jenkins agent node definitions have been backed up to |
Noting that these try to connect about once every 20 minutes in a failure case, and take a varying amount of time to fail the connection, up to 825s |
Of the offline machines in #3486 (comment) Im seeing alot of It seems on the dockerhosts, the ports have been changed?
I wonder what caused this |
The one I've just looked at seemed to be trying to use
|
The ports were rearranged such that alpine nodes became debian/ubuntu/fedora nodes. Alpine uses jdk21 while the others use 17 hence the confusion in jenkins. We use 21 on x64 and arm64 alpine because there is no arm64 alpine jdk17 binary |
Also noting that we're getting EDIT: Noting that the |
👍🏻 We should consider a migration of everything up to 21 where possible (arm32 and Solaris being the exceptions, although arm32 could have an ea-beta build but I'd rather leave those at 17) Ref #3442 (comment) |
Noting that as per #1843 (comment) the machine test-aws-ubuntu2004-x64-1 has been decommissioned so I'll remove that from jenkins too. |
Other than the RISE ones which are offline due to the administrator being away last week, we are left with just two systems showing recurring problems today:
|
test-docker-ubuntu2004-x64-4 has been rebuilt and now works. I'm seeing four in the log now but these are the containers on the Skytap x64 dockerhost which is expired its credits again despite the reduction in size of that system which was put in place for this month:
|
Since the skytap machine is down to 6 cores I'm deleting all of the above agents other than debian12 and UBI8 from the machine |
Closing on the basis that all of these have been resolved other than the Skytap x64 node which is a "known issue" |
On an ssh failure, jenkins is trying to reconnect to machines about once every half hour. We should analyse the list and ensure we know why each is not contactable, and determine whether to remove it, or remediate it, or whether it is a known temporary outage. There are quite a few, particularly in the
test-docker
set, so I'm going to tag @Haroon-Khel on this one. This was identified through other work to clear up the jenkins system logs.Machines which have been non-contactable over ssh by jenkins today
The text was updated successfully, but these errors were encountered: