Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read-only file system on release CI #2626

Closed
richardlau opened this issue Apr 19, 2021 · 39 comments
Closed

Read-only file system on release CI #2626

richardlau opened this issue Apr 19, 2021 · 39 comments

Comments

@richardlau
Copy link
Member

After landing nodejs/nodejs-dist-indexer#15, I checked https://nodejs.org/download/nightly/, to see if I'd missed the nightly, and noticed that we haven't had a nightly build since 17 April but I know the master branch of the core repo has been updated since then.

From https://ci-release.nodejs.org/log/all

Apr 19, 2021 1:00:08 AM WARNING jenkins.model.lazy.LazyBuildMixIn newBuild
A new build could not be created in job iojs+release
java.io.IOException: Read-only file system
	at java.io.UnixFileSystem.createFileExclusively(Native Method)
	at java.io.File.createTempFile(File.java:2024)
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:142)
Caused: java.io.IOException: Failed to create a temporary file in /var/lib/jenkins/jobs/iojs+release
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:144)
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:109)
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:84)
	at hudson.util.AtomicFileWriter.<init>(AtomicFileWriter.java:74)
	at hudson.util.TextFile.write(TextFile.java:116)
	at hudson.model.Job.saveNextBuildNumber(Job.java:283)
	at hudson.model.Job.assignBuildNumber(Job.java:342)
	at hudson.model.Run.<init>(Run.java:322)
	at hudson.model.AbstractBuild.<init>(AbstractBuild.java:166)
	at hudson.matrix.MatrixBuild.<init>(MatrixBuild.java:79)
Caused: java.lang.reflect.InvocationTargetException
	at sun.reflect.GeneratedConstructorAccessor193.newInstance(Unknown Source)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at jenkins.model.lazy.LazyBuildMixIn.newBuild(LazyBuildMixIn.java:181)
	at hudson.model.AbstractProject.newBuild(AbstractProject.java:963)
	at hudson.model.AbstractProject.createExecutable(AbstractProject.java:1139)
	at hudson.model.AbstractProject.createExecutable(AbstractProject.java:138)
	at hudson.model.Executor$1.call(Executor.java:365)
	at hudson.model.Executor$1.call(Executor.java:347)
	at hudson.model.Queue._withLock(Queue.java:1443)
	at hudson.model.Queue.withLock(Queue.java:1304)
	at hudson.model.Executor.run(Executor.java:347)
@richardlau
Copy link
Member Author

cc @nodejs/build-infra

@rvagg
Copy link
Member

rvagg commented Apr 19, 2021

on it

@rvagg
Copy link
Member

rvagg commented Apr 19, 2021

hmm, this might be more complicated than just a dodgy filesystem, having to resort to some rescue operations 🤞

@rvagg
Copy link
Member

rvagg commented Apr 19, 2021

@richardlau do you happen to have the IBM Cloud VPN set up? I think we need to look at the console of the machine, it's in a 1/2 booted state and not accepting SSH connections. I can use the IBM Cloud rescue mode and everything looks fine from a superficial perspective and I don't see any problems in logs so I think watching a boot might be the next step if we can.

Machine is https://cloud.ibm.com/gen1/infrastructure/virtual-server/16320983/details, in the Actions menu there's a "KVM Console" but it needs the VPN.

Alternatively we just take this chance to set up an entirely new 20.04 machine and transfer what settings we can from a rescue boot of this one. I just haven't figured out how to access the additional disk we use for /var/lib/jenkins in rescue mode but maybe we just figure out how to transfer it to a new machine. I probably can't do this today but may be able to allocate a bit of time tomorrow to have a go (it doesn't have to be me if someone else with infra is brave enough).

@richardlau
Copy link
Member Author

I don't currently have IBM Cloud VPN set up. Should probably do so anyway so I'll look into doing that.

I probably can't do this today but may be able to allocate a bit of time tomorrow to have a go (it doesn't have to be me if someone else with infra is brave enough).

This is one of those things I'm scared to touch 😅.

@rvagg
Copy link
Member

rvagg commented Apr 19, 2021

they don't offer 20.04 but I'll start a 18.04 from scratch and try and transfer boot disk data from rescue mode to start with, I'll let you know here when I stop for the day (soon)

@richardlau
Copy link
Member Author

FWIW I think this was the cause of the read only filesystem:

At 17 April 2021 20:16 UTC, customer back-end traffic in the DAL09 data-center may have started experiencing intermittent network connectivity. At 17 April 2021 20:47 UTC, this intermittent back-end network connectivity cleared. During this period, customer back-end traffic in the DAL09 data-center may have experienced degraded network connectivity in the form of latency and/or packet loss. VSI's hosts may have found that their file-systems went read-only requiring a reboot to restore read/write access. If you are still experiencing any issues please reach out to our support department and reference this event ID.

@rvagg
Copy link
Member

rvagg commented Apr 19, 2021

Sounds like it!

I've just made an 18.04 with basic setup, jenkins & java (I haven't bothered looking to see if we have any ansible scripts for this, it'll be way out of date & sync anyway!). I have the old machine in rescue mode and I've just done an rsync of / and /boot onto the new machine in /old-ci-release/. This is my inventory for the new machine infra-ibm-ubuntu1804-x64-1: {ip: 169.45.166.50}. It's got nodejs_build_infra in it.

I've also managed to get access to the /var/lib/jenkins disk but it's a complete mess. The superblock was borked and an fsck has moved everything to lost+found, we've lost most of the directory structure but retained most of the files. It's going to be impossible to rebuild this I think but we have access to important pieces if we can find them (with lots of find and grep ..).

It looks like our backup is active though (thanks to whoever fixed that up when it overflowed last time!). /backup/periodic/daily.0/ci-release.nodejs.org has a most-recent file dated April 17th. I'm currently rsyncing that to the new server as /jenkins-backup/. Hopefully it contains enough of the key pieces to get this all back online properly.

I have to head off for the day and have a busy day ahead of me tomorrow but will try and find some time to hop back in here and continue.

Here's the next steps I think we'll need to take (feel free to try and tackle any/all of them without me!):

  • Allocate a new 300G disk in Dallas 9, mount it as /var/lib/jenkins on the new machine (see /old-ci-release/etc/fstab for how we did this previously) and move /jenkins-backup/ contents into it.
  • Configure iptables .. we don't have this in backup, but we have it from the old machine, /old-ci-release/etc/iptables* has everything we should need 🤞.
  • Set up nginx and the ssl stuff, to front Jenkins. Again /etc/nginx/* should have everything I think 🤞.
  • Move ci-release DNS in Cloudflare.
  • Convince all of the release nodes to connect to this new server.
  • Make sure the backup server talks to this new one properly (probably requires resetting ssh host key -- although we could copy the old host key from /old-ci-release/etc/ssh/).
  • Put this new server in inventory.xml - ansible: update ci-release in inventory #2627
  • ... ?

@rvagg
Copy link
Member

rvagg commented Apr 19, 2021

btw, if you want to log into the old machine, it's in rescue mode and will require a password which you can get from https://cloud.ibm.com/gen1/infrastructure/virtual-server/16320983/details#passwords, it's some weird IBM Cloud OS but I have all the disks mounted under /mnt/ (see df).

@richardlau
Copy link
Member Author

+1 Thanks for your time @rvagg. I'll see how far I can get through the list and keep this updated.
FWIW I've been trying to set up IBM Cloud VPN but it's proving challenging, especially with the not-an-admin restrictions I have on my corporate laptops.

@richardlau richardlau pinned this issue Apr 19, 2021
@rvagg
Copy link
Member

rvagg commented Apr 19, 2021

rsync of /jenkins-backup/ is done now btw

@richardlau
Copy link
Member Author

Allocate a new 300G disk in Dallas 9,

I've requested that, showing as jenkins-release under portable storage, but haven't figured out how to attach it to infra-ibm-ubuntu1804-x64-1.nodejs.cloud.

@richardlau
Copy link
Member Author

richardlau commented Apr 19, 2021

I've run iptables-restore /old-ci-release/root/richard-20210311 to apply what I believe is the most recent backed up iptables edit we've done to the release server (it contains, for example, release-macstadium-macos11.0-arm64-1 which is the Apple Silicon release machine). I'm double checking the Joyent IP addresses as that was the other most recent change that I remember.

Update: The new Joyent IP's are in the firewall rules. I got slightly confused as we still have some of the old IP addresses in there still, so potential for some tidying up later.

@richardlau
Copy link
Member Author

richardlau commented Apr 19, 2021

I think I have nginx set up on the new server, copying across config from /old-ci-release/etc/nginx (and updating the apparently deprecated spdy to http2 in the /etc/nginx/sites-available/jenkins-iojs). At least the nginx.service started with no obvious errors, but without Jenkins started there's nothing to forward to 😄.

@mhdawson
Copy link
Member

mhdawson commented Apr 19, 2021

Been on with support for the last hour or so, seems like something still needs to be done so that the disk will show up.

Also got the vpn going on one of my machines, but the KVM console does not seem to give me anything so that is the next thing to ask them about

@mhdawson
Copy link
Member

From the cloud UI, looks like the storage is now attached. Next we need to check it is accessible in the machine itself.

@richardlau
Copy link
Member Author

Looks like it might be /dev/xvdc based on the timestamps:

root@infra-ibm-ubuntu1804-x64-1:~# ls -al /dev/xvd*
brw-rw---- 1 root disk 202,  0 Apr 19 11:08 /dev/xvda
brw-rw---- 1 root disk 202,  1 Apr 19 11:08 /dev/xvda1
brw-rw---- 1 root disk 202,  2 Apr 19 11:08 /dev/xvda2
brw-rw---- 1 root disk 202, 16 Apr 19 11:08 /dev/xvdb
brw-rw---- 1 root disk 202, 17 Apr 19 11:08 /dev/xvdb1
brw-rw---- 1 root disk 202, 32 Apr 19 18:19 /dev/xvdc
root@infra-ibm-ubuntu1804-x64-1:~#

I'll try mounting that via fstab and see what we get.

@mhdawson
Copy link
Member

I do have KVM console access now. Looks like the rescue os, not sure if we want to reboot yet.

@richardlau
Copy link
Member Author

Partitioned /dev/xvdc and formatted /dev/xvdc1 as ext4. Mounted and now copying across /jenkins-backup to /var/lib/jenkins.

@richardlau
Copy link
Member Author

Copy is complete. Jenkins started up (I had to restart as I'd changed owner/group for the files/directories under /var/lib/jenkins but forgot the /var/lib/jenkins dir itself). I don't think I can test access to it without switching the dns entry in Cloudflare (if I go to the IP address of the new server (169.45.166.50) it times out saying ci-release.nodejs.org is taking too long to respond). Will look at that next.

@richardlau
Copy link
Member Author

DNS has been switched. https://ci-release.nodejs.org/ loads (🎉). Executors are mostly offline -- will go around and see if restarting the agent on a few of them is enough to get them connected 🤞 .

@richardlau
Copy link
Member Author

The executors/nodes reconnected themselves after a few minutes 😀.
Started a test build 🤞: https://ci-release.nodejs.org/job/iojs+release/6834/

@richardlau
Copy link
Member Author

richardlau commented Apr 19, 2021

Test build is green. I think we'll be able to release Node.js 16 tomorrow 🤞 😌 (cc @BethGriggs ). Thanks @rvagg and @mhdawson for your help.

I did spot that we're missing a cross compiler for armv7l but that's related to moving to gcc 8 and not because of the server issue in this PR. I've added the cross-compiler-ubuntu1804-armv7-gcc-8 label to iojs+release (https://github.com/nodejs/jenkins-config-release/commit/63b1a778614c15f4e1be29e051fffc011c788b59) and started a build to test that: https://ci-release.nodejs.org/job/iojs+release/6835/

Also the AIX build in https://ci-release.nodejs.org/job/iojs+release/6834/nodes=aix72-ppc64/ seemed to take 49 minutes to scp the binary over to the download server which seems slow (albeit it did complete successfully) but shouldn't have anything to do with the Jenkins server. I've restarted the agent on the AIX release machine in any case and look at the nightlies in the morning to see if it's still an issue.

There's a remaining task on the list about checking the backup server can connect to the new server but I'm kind of beat for the day and am going to log out.

@richardlau
Copy link
Member Author

Updated /etc/crontab to add the backups to https://github.com/nodejs/jenkins-config-release
(This is separate from the backup item listed in #2626 (comment))

@richardlau
Copy link
Member Author

richardlau commented Apr 19, 2021

I think we also need to update https://grafana.nodejs.org/ for the new ci-release server, but I have no idea how to do that.
image

@rvagg
Copy link
Member

rvagg commented Apr 20, 2021

Wow, fantastic work @richardlau! And in this we got an upgrade off 16.04 plus an OpenJDK JVM for Jenkins. So a value-added recovery.

ARMv7 is green and it's using GCC 8:

07:46:32 + ccache /opt/raspberrypi/rpi-newer-crosstools/x64-gcc-8.3.0/arm-rpi-linux-gnueabihf/bin/arm-rpi-linux-gnueabihf-gcc -march=armv7-a --version
07:46:32 arm-rpi-linux-gnueabihf-gcc (crosstool-NG 1.24.0-rc3) 8.3.0

and compiling with:

ccache /opt/raspberrypi/rpi-newer-crosstools/x64-gcc-8.3.0/arm-rpi-linux-gnueabihf/bin/arm-rpi-linux-gnueabihf-g++ -march=armv7-a -o 

The grafana setup was @jbergstroem, I saw a custom APT source in there for that but there's probably also some config that needs to be in place as well.

Re Node 16, I don't know if there's time for another RC but it might be worth @BethGriggs running through the motions today if possible to test it all out.

@richardlau
Copy link
Member Author

@AshCripps
Copy link
Member

I think we also need to update https://grafana.nodejs.org/ for the new ci-release server, but I have no idea how to do that.

I assume it just need the telegraf agent redeployed/reconfigured?

@richardlau
Copy link
Member Author

I've put the backup ssh key in the authorized_keys on the new server and checked I can ssh into ci-release.nodejs.org from the backup server. I reset the host key for ci-release.nodejs.org on the backup server.

richardlau added a commit that referenced this issue Apr 20, 2021
Replace `infra-softlayer-ubuntu14-x64-1` as `ci-release` with
`infra-ibm-ubuntu1804-x64-1`.

Refs: #2626
@richardlau
Copy link
Member Author

Backup looks to have worked. I'll look at the telegraf agent for grafana, which I think is the only remaining thing left.

@richardlau
Copy link
Member Author

Added the telegraf agent to the server and copied the config over from the old ci-release disk:

$ curl -s https://repos.influxdata.com/influxdb.key | sudo apt-key add -
$ source /etc/lsb-release
$ echo "deb https://repos.influxdata.com/${DISTRIB_ID,,} ${DISTRIB_CODENAME} stable" | sudo tee /etc/apt/sources.list.d/influxdb.list
$ apt-get update
$ apt-get install telegraf
$ cp -R /old-ci-release/etc/telegraf/* /etc/telegraf/
$ systemctl restart telegraf.service

(the steps to install the telegraf agent taken from history on the ci server and follow https://docs.influxdata.com/telegraf/v1.18/introduction/installation/).

Grafana now shows stats for ci-release (🎉):
image

I believe everything's now done. I'll follow up separately with some more docs based on what we had to do to get all of this back up and running.

@rvagg rvagg reopened this Apr 22, 2021
@rvagg
Copy link
Member

rvagg commented Apr 22, 2021

Let's leave this open until we've cleaned up old resources - we have the server plus the 300G disk that we need to spin down. Do we have confidence to do this now or should we wait?

@mhdawson
Copy link
Member

Maybe we should wait until we've done a release for all of the release lines in case there is some release specific dependency?

@richardlau
Copy link
Member Author

I don't believe there is any release specific dependency for the release CI server outside of the Jenkins job configuration as the individual release- machines and staging server were unaffected.

I'm reasonably confident that we can clear up the old resources but I can run test builds if desired (I don't think it's necessary).

FWIW in terms of storage I noticed we have an unattached 1000GB at Dallas 5 -- anyone have any idea what that is? I'm assuming if it's unattached it can be deleted?
image
jenkins is the old 300GB disk and the unattached jenkins-release-new was @mhdawson 's attempt to create portable storage as we initially struggled to attach the new storage (jenkins-release) to the replacement server.

I've also started a "disaster recovery plan" document over in #2634 with pointers to where we have backups/alternative places to recover configuration.

@richardlau
Copy link
Member Author

FWIW in terms of storage I noticed we have an unattached 1000GB at Dallas 5 -- anyone have any idea what that is? I'm assuming if it's unattached it can be deleted?

We appear to only have one server at Dallas 5:
image

but that isn't the current test-softlayer-centos6-x64-2, which is at Washington 7 (IP matches the one in the inventory):
image

It appears the one at Dallas 5 was replaced at some point in the past #2480 / #1074. So we can probably get rid of the Dallas 5 server and the unattached storage?

FWIW I also don't think it's a great idea to have both of the test centos6 x64 hosts at the same datacenter with the same cloud provider (i.e. an outage could potentially take out all of our centos6 x64 test hosts) but maybe we can retire centos 6 entirely when Node.js 10 goes End-of-Life at the end of the month (only three days left!).

@rvagg
Copy link
Member

rvagg commented Apr 28, 2021

I'm pretty confident the unattached 1000GB disk can go, nothing lost there. Mostly these cases of unattached disks are a result of a failure to cleanup after some kind of migration (to a new DC or to resize the disk). We don't use unattached disks anywhere in our storage strategy so if nobody claims it for very-recent use (like the unattached 300GB old jenkins disk) then it can go. rm all the things.

@sxa
Copy link
Member

sxa commented Jul 7, 2021

@richardlau Can this be closed now? It's still sitting as a pinned issue so shows up and gives me a minor panic each time I go to the issue list in this repository :-)

@richardlau richardlau unpinned this issue Jul 7, 2021
@richardlau
Copy link
Member Author

We still need to clear up the old resources, but this definitely doesn't need to remain pinned.

@github-actions
Copy link

github-actions bot commented May 4, 2022

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants