Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build dependency on single machine - node-ci-test-equinix-ubuntu2004_container-armv7l-1 #2835

Closed
mhdawson opened this issue Dec 21, 2021 · 10 comments
Assignees

Comments

@mhdawson
Copy link
Member

I had thought that we added containers on our large machine to help with the arm build but it looks like we have an additional job to test on those containers instead.

Since we only have one of those containers set up, when it lost it's jenkins connection the build backed up (just firgured that out and restarted the container that had lost the jenkins connection).

We should like add other instances and consider if we can use them in the fanned testing as well.

@sxa what do you think?

@sxa
Copy link
Member

sxa commented Dec 21, 2021

Yep 100% we need more of these now that it's proved to have fundamentally worked, so this is a good time to look at scaling it up in a repeatable way.

@sxa sxa self-assigned this Dec 21, 2021
@Trott
Copy link
Member

Trott commented Dec 28, 2021

CI has been stuck now for a few days since test-equinix-ubuntu2004_container-armv7l-1 has been unavailable. I don't suppose anyone is around who knows how to fix it? I'm guessing not and either I'll figure it out somehow (if I even have the right permissions) or (more likely) I'll mess it up worse and/or we will otherwise have to wait until January to sort this out.

@sxa
Copy link
Member

sxa commented Dec 29, 2021

Container came back properly after managing to get the host started again and is chewing through a backlog of jobs. Interested to know why the container is going down though. While adding extra redundancy into the system is obviously a good thing, I'm nervous about the fact this isn't the first time that container has disconnected itself...

@targos
Copy link
Member

targos commented Jan 16, 2022

The host is offline again and blocking CI: https://ci.nodejs.org/job/node-test-commit-arm/nodes=ubuntu2004-armv7l/

@richardlau
Copy link
Member

Container came back properly after managing to get the host started again and is chewing through a backlog of jobs. Interested to know why the container is going down though. While adding extra redundancy into the system is obviously a good thing, I'm nervous about the fact this isn't the first time that container has disconnected itself...

Logged into the host machine:

root@test-equinix-ubuntu2004-docker-arm64-1:~# docker logs node-ci-test-equinix-ubuntu2004_container-armv7l-1
...
Jan 14, 2022 4:53:27 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Jan 15, 2022 5:47:26 PM org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer onRecv
WARNING: [JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913]
java.lang.NullPointerException
        at org.jenkinsci.remoting.util.DirectByteBufferPool.acquire(DirectByteBufferPool.java:78)
        at org.jenkinsci.remoting.protocol.IOHub.acquire(IOHub.java:165)
        at org.jenkinsci.remoting.protocol.ProtocolStack.acquire(ProtocolStack.java:439)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:331)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:677)
        at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:49)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:291)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
        at java.lang.Thread.run(Thread.java:748)

Jan 15, 2022 5:47:26 PM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
SEVERE: [JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913] Reader thread killed by NullPointerException
java.lang.NullPointerException
        at org.jenkinsci.remoting.util.DirectByteBufferPool.acquire(DirectByteBufferPool.java:78)
        at org.jenkinsci.remoting.protocol.IOHub.acquire(IOHub.java:165)
        at org.jenkinsci.remoting.protocol.ProtocolStack.acquire(ProtocolStack.java:439)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SSLEngineFilterLayer.java:331)
        at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngineFilterLayer.java:117)
        at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.java:677)
        at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:136)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetworkLayer.java:49)
        at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:291)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
        at java.lang.Thread.run(Thread.java:748)

Jan 15, 2022 5:49:29 PM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913.
java.util.concurrent.TimeoutException: Ping started at 1642268849904 hasn't completed by 1642268969905
        at hudson.remoting.PingThread.ping(PingThread.java:134)
        at hudson.remoting.PingThread.run(PingThread.java:90)

Jan 15, 2022 5:51:29 PM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913.
java.util.concurrent.TimeoutException: Ping started at 1642268969907 hasn't completed by 1642269089908
        at hudson.remoting.PingThread.ping(PingThread.java:134)
        at hudson.remoting.PingThread.run(PingThread.java:90)

Jan 15, 2022 5:53:28 PM hudson.Launcher$RemoteLaunchCallable$1 join
INFO: Failed to synchronize IO streams on the channel hudson.remoting.Channel@a60e45:JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913
hudson.remoting.ChannelClosedException: Channel "unknown": Protocol stack cannot write data anymore. It is not open for write
        at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.write(ChannelApplicationLayer.java:331)
        at hudson.remoting.AbstractByteBufferCommandTransport.write(AbstractByteBufferCommandTransport.java:301)
        at hudson.remoting.Channel.send(Channel.java:766)
        at hudson.remoting.Request.call(Request.java:167)
        at hudson.remoting.Channel.call(Channel.java:1000)
        at hudson.remoting.Channel.syncIO(Channel.java:1739)
        at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1406)
        at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:929)
        at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:902)
        at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:854)
        at hudson.remoting.UserRequest.perform(UserRequest.java:211)
        at hudson.remoting.UserRequest.perform(UserRequest.java:54)
        at hudson.remoting.Request$2.run(Request.java:376)
        at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:122)
        at java.lang.Thread.run(Thread.java:748)

Jan 15, 2022 5:53:28 PM hudson.remoting.Request$2 run
INFO: Failed to send back a reply to the request UserRequest:UserRPCRequest:hudson.Launcher$RemoteProcess.join[](55): hudson.remoting.ChannelClosedException: Channel "unknown": Protocol stack cannot write data anymore. It is not open for write

I've restarted the container (systemctl restart jenkins-test-equinix-ubuntu2004_container-armv7l-1.service).

cc @sxa

@richardlau
Copy link
Member

This is offline again 😞.
Can't ssh into the host, test-equinix-ubuntu2004_docker-arm64-1 -- connection timing out.

@richardlau
Copy link
Member

@sxa has got the host and its containers back online ❤️.

@richardlau
Copy link
Member

Will be fixed by #2911 (thanks @sxa ).

@Trott
Copy link
Member

Trott commented Apr 1, 2022

Will be fixed by #2911 (thanks @sxa ).

#2911 has been merged. Should this be closed?

@richardlau
Copy link
Member

I've just run the Ansible playbooks on the second Equinix hosted docker host and we now have
test-equinix-ubuntu2004_container-armv7l-2 online.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants