Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow up: test-macos10.15-x64-2 offline #3218

Closed
richardlau opened this issue Mar 13, 2023 · 25 comments
Closed

Follow up: test-macos10.15-x64-2 offline #3218

richardlau opened this issue Mar 13, 2023 · 25 comments

Comments

@richardlau
Copy link
Member

All of our macOS 10.14 VMs (all in Orka) are offline.

Initially it was because the agents needed to be updated after #3176 -- I've done that (via the upgrade-jar playbook), but now the agents are refusing to start as they need newer Java. It looks like these are still running Java 8.

@richardlau
Copy link
Member Author

It also looks like test-orka-macos10.14-x64-2 is unreachable

test-orka-macos10.14-x64-1 : ok=3    changed=2    unreachable=0    failed=0    skipped=5    rescued=0    ignored=0
test-orka-macos10.14-x64-2 : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0
test-orka-macos10.14-x64-3 : ok=3    changed=2    unreachable=0    failed=0    skipped=4    rescued=0    ignored=0

@richardlau richardlau changed the title orka macOS 10.14 VMs offline orka macOS 10.14 and 10.15 VMs offline Mar 13, 2023
@richardlau
Copy link
Member Author

Looks like the orka 10.15 VMs are also offline. I logged into test-orka-macos10.15-x64-1 and that appears to be running Java 8 as well.

% java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.242-b08, mixed mode)

@richardlau
Copy link
Member Author

These were updated to Java 17 previously #3030 (comment) / #3085.

@richardlau
Copy link
Member Author

These were updated to Java 17 previously #3030 (comment) / #3085.

We rebuilt these VMs in January #3112.

@sxa
Copy link
Member

sxa commented Mar 13, 2023

Nearform macs are running with Eclipse Temurin 17 with /usr/bin/java pointing there.

administrator@test-nearform-macos10 ~iojs % java -version
openjdk version "17.0.6" 2023-01-17
OpenJDK Runtime Environment Temurin-17.0.6+10 (build 17.0.6+10)
OpenJDK 64-Bit Server VM Temurin-17.0.6+10 (build 17.0.6+10, mixed mode, sharing)
administrator@test-nearform-macos10 ~iojs % ls -l /usr/bin/java
lrwxr-xr-x  1 root  wheel  74 18 Dec 13:41 /usr/bin/java -> /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/java
administrator@test-nearform-macos10 ~iojs % 

@sxa
Copy link
Member

sxa commented Mar 13, 2023

So steps required to fix the orka systems:

  1. Install a Temurin 17 (The location being used by the nearform makes still contains JDK8 on the orka systems)
  2. Ensure that /usr/bin/java points there (or set the path accordingly in the start.sh script
  3. Update the agent.jar (was 4.2, needs 4.7)
  4. Fix the start.sh so that SECRET is replaced accordingly for the machine (and update PATH, and name of agent.jar if required)

We should check the correct upgrade path should be for java and whether it makes sense to be able to ensure the playbooks can do that.

The above steps have allowed test-orka-macos10.15-x64-1 to come back online, although we will need to ensure that this does not conflict with anything in the playbooks. Note that I've left the old slave.jar on the machine, but it is now using the new agent.jar downloaded from the server.

@richardlau
Copy link
Member Author

@sxa Does the playbook not work?

@sxa
Copy link
Member

sxa commented Mar 13, 2023

I haven't tried running - was just looking into the cause :-)
From your comment earlier should the two online 10.14 machines be all "playbooked up" now? -1 and -3 don't seem to have come back yet.

@richardlau
Copy link
Member Author

@sxa I only ran the playbook to update the agent, not the full playbook for Jenkins workers.

@richardlau
Copy link
Member Author

@UlisesGascon is this something you could take a look at?

@UlisesGascon
Copy link
Member

Yes, let me see 👍

@UlisesGascon UlisesGascon self-assigned this Mar 18, 2023
@UlisesGascon
Copy link
Member

UlisesGascon commented Mar 19, 2023

Current situation

I managed to manually recover:

  • test-orka-macos10.14-x64-1 (current Java version: 17.0.6)
  • test-orka-macos10.14-x64-3 (current Java version: 19.0.2)

The recovering process I followed was very similar to #3218 (comment)

Before
Captura de pantalla 2023-03-19 a las 10 21 04

After
Captura de pantalla 2023-03-19 a las 11 10 19

Next Steps

  • Solve connectivity issues with test-orka-macos11-x64-2, test-orka-macos11-x64-1, test-orka-macos10.14-x64-2
    - [ ] Solve configuration issues with test-orka-macos10.15-x64-2 as the Java version and Jenkins settings are fine but the machine is not connecting to Jekins
    - [ ] Free Disk space for test-orka-macos11-x64-1

@UlisesGascon
Copy link
Member

Current status

I managed to manually recover:

  • test-orka-macos11-x64-2 (current Java version: 19.0.2)

Solved connectivity issues for:

  • test-orka-macos11-x64-2 (current Java version: 19.0.2)
  • test-orka-macos11-x64-1 (current Java version: 19.0.2)
  • test-orka-macos10.14-x64-2 (current Java version: 17.0.5)
  • test-orka-macos10.15-x64-2 (current Java version: 19.0.2)

Before
Captura de pantalla 2023-03-19 a las 11 10 19

After
Captura de pantalla 2023-03-19 a las 13 34 21

Current challenges

1. Release machine(s)

The machine test-orka-macos11-x64-1 connected to Jenkins is not the actual test-orka-macos11-x64-1, probably this machine is release-macos1015-x64-1. In case that release-macos1015-x64-1 is not the machine that is kidnapping the agent, then worth to check manually release-macos11-x64-1 from #3185 as it is not in the inventory.

I can't access to these machines via SSH as they are release ones, but we will need to check the connectivity and Java version too. Maybe @targos can help here.

Important

The machine test-orka-macos11-x64-1 should be fine and full configured including Jenkins agent, but I disabled that node in Jenkins until is clear that other machine is not kidnapping the agent

Captura de pantalla 2023-03-19 a las 13 51 04

1. Jenkins configuration

The machines test-orka-macos10.15-x64-2 and test-orka-macos10.14-x64-2 are up and running but not connecting to Jenkins as Nodes

test-orka-macos10.15-x64-2 Errors:

Waiting 10 seconds before retry
Failing to obtain https://ci.nodejs.org/computer/test-macstadium-macos10.15-x64-1/slave-agent.jnlp?encrypt=true
java.io.IOException: Failed to load https://ci.nodejs.org/computer/test-macstadium-macos10.15-x64-1/slave-agent.jnlp?encrypt=true: 404 Not Found
	at hudson.remoting.Launcher.parseJnlpArguments(Launcher.java:521)
	at hudson.remoting.Launcher.run(Launcher.java:347)
	at hudson.remoting.Launcher.main(Launcher.java:298)
Waiting 10 seconds before retry
Failing to obtain https://ci.nodejs.org/computer/test-macstadium-macos10.15-x64-1/slave-agent.jnlp?encrypt=true
java.io.IOException: Failed to load https://ci.nodejs.org/computer/test-macstadium-macos10.15-x64-1/slave-agent.jnlp?encrypt=true: 404 Not Found
	at hudson.remoting.Launcher.parseJnlpArguments(Launcher.java:521)
	at huds^C

Not sure why is pointing to Jnlp *-macstadium-* when the start.sh is setup to the right address.

...
PATH="/usr/local/opt/ccache/libexec:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin" java -Xmx128m \
    -jar /Users/iojs/slave.jar -secret REDACTED \
    -jnlpUrl https://ci.nodejs.org/computer/test-orka-macos10.15-x64-2/slave-agent.jnlp    

I tried to use the new agent and the old one with different settings and I got no luck, I tried to mimic the settings from other machines and no luck.

test-orka-macos10.14-x64-2 Errors:

INFO: Protocol JNLP4-connect encountered an unexpected exception
java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: test-orka-macos10.14-x64-1 is already connected to this controller. Rejecting this connection.
	at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
	at hudson.remoting.Engine.innerRun(Engine.java:740)
	at hudson.remoting.Engine.run(Engine.java:518)
Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: test-orka-macos10.14-x64-1 is already connected to this controller. Rejecting this connection.
	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:378)
	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:433)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48)
	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.nio.channels.ClosedChannelException
		... 7 more

I checked the configuration and all seems correct including the secrets:

...

export PATH="/usr/local/opt/python3/Frameworks/Python.framework/Versions/Current/bin:/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin"
export PATH="$(brew --prefix)/opt/ccache/libexec:$PATH"
java -Xmx128m \
    -jar /Users/iojs/agent.jar -secret REDACTED \
    -jnlpUrl https://ci.nodejs.org/manage/computer/test-orka-macos10.14-x64-2/jenkins-agent.jnlp

Next steps

I am happy to provide support for the next steps but I will need help with Jenkins settings and Release access (cc: @nodejs/build ).

  • Check the Jenkins settings and Java Version for the Release machines in the Release Jenkins CI
  • Patch the Jenkins agent for test-orka-macos10.14-x64-2
  • Patch the Jenkins agent for test-orka-macos10.15-x64-2

@UlisesGascon UlisesGascon removed their assignment Mar 19, 2023
@targos
Copy link
Member

targos commented Mar 20, 2023

I'm unable to SSH into release-macos1015-x64-1 (it doesn't accept the release ssh key and ends up asking for a password). Same for -macos11-x64-1

@UlisesGascon
Copy link
Member

@targos if you want I can try to redeploy the release-macos1015-x64-1 vm, maybe that solves the ssh key issue. But the release-macos11-x64-1 hasn't yet been configured with ansible or ssh key access (is a fresh vm) if remember well from the pending actions in #3185.

@richardlau
Copy link
Member Author

Just a thought, but didn't we snapshot the VMs back in January #3112 (comment)?

@UlisesGascon
Copy link
Member

Yes, Should I try to restore the release ones from the snapshots? Maybe this restore the machines. Great point @richardlau

@richardlau
Copy link
Member Author

If we have snapshots I think it would be worth trying.

FWIW test-orka-macos10.15-x64-1 (set up in #3218 (comment)) is failing v14.x builds, e.g. https://ci.nodejs.org/job/node-test-commit-osx/51258/nodes=osx1015/console

12:14:16 No receipt for 'com.apple.pkg.CLTools_Executables' found at '/'.
12:14:16 
12:14:16 No receipt for 'com.apple.pkg.DeveloperToolsCLILeo' found at '/'.
12:14:16 
12:14:16 No receipt for 'com.apple.pkg.DeveloperToolsCLI' found at '/'.
12:14:16 
12:14:16 gyp: No Xcode or CLT version detected!

which is https://github.com/nodejs/build/blob/main/ansible/MANUAL_STEPS.md#install-command-line-tools-for-xcode
Hopefully the snapshots contain a properly configured command line tools for xcode.

@UlisesGascon
Copy link
Member

@richardlau I will try to restore test-orka-macos10.15-x64-2 from snapshot and then close this issue.

As today, failing v14.x builds and macos 10.14 related issues are not relevant 🤔

@UlisesGascon UlisesGascon self-assigned this May 7, 2023
@UlisesGascon
Copy link
Member

I tried to use the images macos1015-x64-2_11012023 and nodejs-test-1015.img to restore the VM and It didn't work at all, the macos vm didn't allow any SSH or VNC Connection.

I will try to see if by generating a new image from vm macos1015-x64-1 can solve the problem and then I will re-ansible the vm.

@UlisesGascon
Copy link
Member

I created a new VM using nodejs-test-1015.img as the base image and the connectivity is down even using VNC inside the VPN. I will open a ticket to the support team

@UlisesGascon
Copy link
Member

Ticket opened with support: SERVICE-164961. I will keep tracking the communications

@UlisesGascon
Copy link
Member

UlisesGascon commented Jul 7, 2023

The response:

Hi,

This is due to an issue regarding Docker images being removed from some of our source control code bases. We can fix this by upgrading your environment. We have moved our internal routes to a different repository.

Would you like me to conduct an upgrade today?

--

Anahit

I think this is a clear yes. Maybe this solve #3415 as well 🤔. Last upgrade required us to re-deploy the VMs in the correct spots, and re-ansible also if I remember well

What do you think @nodejs/build ?

@UlisesGascon UlisesGascon changed the title orka macOS 10.14 and 10.15 VMs offline Follow up: test-macos10.15-x64-2 offline Jul 11, 2023
@UlisesGascon
Copy link
Member

The VM after the upgrade didn't connect, so I asked for the additional support.

@UlisesGascon
Copy link
Member

**test-macos10.15-x64-2 is back 🥳 **

I purged the previous VM and regenerated the new VM following the recommendation from the support team. I re-ansibled the machine.

I see that this VM is named with 1015 in its name, is it safe to assume this is running MacOS 10.15? I see that this VM is launched with Net-Boost & IO-Boost enabled. While IO-Boost is supported on MacOS 10.14.5 or later, Net-Boost requires MacOS 11.0 or later.

If Net-Boost is enabled on a VM running an older version of MacOS, networking will not function properly. When testing within the cluster’s network, I was unable to connect either. Which leads me to believe this is the root cause.

Try recreating this VM’s config with Net Boost disabled, that should resolve the issue for this VM. I’ve attached a snippet showing your launched VMs and the configurations. I attempted to SSH into the first in the list which seems to also be on 10.15, and was able to get a response.

Previous
Captura de pantalla 2023-07-13 a las 19 27 52

Current
Captura de pantalla 2023-07-13 a las 19 53 12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants