-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read-only file system on release CI #2626
Comments
cc @nodejs/build-infra |
on it |
hmm, this might be more complicated than just a dodgy filesystem, having to resort to some rescue operations 🤞 |
@richardlau do you happen to have the IBM Cloud VPN set up? I think we need to look at the console of the machine, it's in a 1/2 booted state and not accepting SSH connections. I can use the IBM Cloud rescue mode and everything looks fine from a superficial perspective and I don't see any problems in logs so I think watching a boot might be the next step if we can. Machine is https://cloud.ibm.com/gen1/infrastructure/virtual-server/16320983/details, in the Actions menu there's a "KVM Console" but it needs the VPN. Alternatively we just take this chance to set up an entirely new 20.04 machine and transfer what settings we can from a rescue boot of this one. I just haven't figured out how to access the additional disk we use for /var/lib/jenkins in rescue mode but maybe we just figure out how to transfer it to a new machine. I probably can't do this today but may be able to allocate a bit of time tomorrow to have a go (it doesn't have to be me if someone else with infra is brave enough). |
I don't currently have IBM Cloud VPN set up. Should probably do so anyway so I'll look into doing that.
This is one of those things I'm scared to touch 😅. |
they don't offer 20.04 but I'll start a 18.04 from scratch and try and transfer boot disk data from rescue mode to start with, I'll let you know here when I stop for the day (soon) |
FWIW I think this was the cause of the read only filesystem:
|
Sounds like it! I've just made an 18.04 with basic setup, jenkins & java (I haven't bothered looking to see if we have any ansible scripts for this, it'll be way out of date & sync anyway!). I have the old machine in rescue mode and I've just done an rsync of / and /boot onto the new machine in /old-ci-release/. This is my inventory for the new machine I've also managed to get access to the /var/lib/jenkins disk but it's a complete mess. The superblock was borked and an fsck has moved everything to lost+found, we've lost most of the directory structure but retained most of the files. It's going to be impossible to rebuild this I think but we have access to important pieces if we can find them (with lots of It looks like our backup is active though (thanks to whoever fixed that up when it overflowed last time!). /backup/periodic/daily.0/ci-release.nodejs.org has a most-recent file dated April 17th. I'm currently rsyncing that to the new server as /jenkins-backup/. Hopefully it contains enough of the key pieces to get this all back online properly. I have to head off for the day and have a busy day ahead of me tomorrow but will try and find some time to hop back in here and continue. Here's the next steps I think we'll need to take (feel free to try and tackle any/all of them without me!):
|
btw, if you want to log into the old machine, it's in rescue mode and will require a password which you can get from https://cloud.ibm.com/gen1/infrastructure/virtual-server/16320983/details#passwords, it's some weird IBM Cloud OS but I have all the disks mounted under /mnt/ (see |
+1 Thanks for your time @rvagg. I'll see how far I can get through the list and keep this updated. |
rsync of /jenkins-backup/ is done now btw |
I've requested that, showing as |
I've run Update: The new Joyent IP's are in the firewall rules. I got slightly confused as we still have some of the old IP addresses in there still, so potential for some tidying up later. |
I think I have nginx set up on the new server, copying across config from |
Been on with support for the last hour or so, seems like something still needs to be done so that the disk will show up. Also got the vpn going on one of my machines, but the KVM console does not seem to give me anything so that is the next thing to ask them about |
From the cloud UI, looks like the storage is now attached. Next we need to check it is accessible in the machine itself. |
Looks like it might be
I'll try mounting that via |
I do have KVM console access now. Looks like the rescue os, not sure if we want to reboot yet. |
Partitioned |
Copy is complete. Jenkins started up (I had to restart as I'd changed owner/group for the files/directories under |
DNS has been switched. https://ci-release.nodejs.org/ loads (🎉). Executors are mostly offline -- will go around and see if restarting the agent on a few of them is enough to get them connected 🤞 . |
The executors/nodes reconnected themselves after a few minutes 😀. |
Test build is green. I think we'll be able to release Node.js 16 tomorrow 🤞 😌 (cc @BethGriggs ). Thanks @rvagg and @mhdawson for your help. I did spot that we're missing a cross compiler for armv7l but that's related to moving to gcc 8 and not because of the server issue in this PR. I've added the Also the AIX build in https://ci-release.nodejs.org/job/iojs+release/6834/nodes=aix72-ppc64/ seemed to take 49 minutes to scp the binary over to the download server which seems slow (albeit it did complete successfully) but shouldn't have anything to do with the Jenkins server. I've restarted the agent on the AIX release machine in any case and look at the nightlies in the morning to see if it's still an issue. There's a remaining task on the list about checking the backup server can connect to the new server but I'm kind of beat for the day and am going to log out. |
Updated |
I think we also need to update https://grafana.nodejs.org/ for the new ci-release server, but I have no idea how to do that. |
Wow, fantastic work @richardlau! And in this we got an upgrade off 16.04 plus an OpenJDK JVM for Jenkins. So a value-added recovery. ARMv7 is green and it's using GCC 8:
and compiling with:
The grafana setup was @jbergstroem, I saw a custom APT source in there for that but there's probably also some config that needs to be in place as well. Re Node 16, I don't know if there's time for another RC but it might be worth @BethGriggs running through the motions today if possible to test it all out. |
Another 16 rc was started in nodejs/node#37678 (comment): https://ci-release.nodejs.org/job/iojs+release/6836/ |
I assume it just need the telegraf agent redeployed/reconfigured? |
I've put the backup ssh key in the |
Replace `infra-softlayer-ubuntu14-x64-1` as `ci-release` with `infra-ibm-ubuntu1804-x64-1`. Refs: #2626
Backup looks to have worked. I'll look at the telegraf agent for grafana, which I think is the only remaining thing left. |
Added the telegraf agent to the server and copied the config over from the old ci-release disk:
(the steps to install the telegraf agent taken from Grafana now shows stats for ci-release (🎉): I believe everything's now done. I'll follow up separately with some more docs based on what we had to do to get all of this back up and running. |
Let's leave this open until we've cleaned up old resources - we have the server plus the 300G disk that we need to spin down. Do we have confidence to do this now or should we wait? |
Maybe we should wait until we've done a release for all of the release lines in case there is some release specific dependency? |
I don't believe there is any release specific dependency for the release CI server outside of the Jenkins job configuration as the individual I'm reasonably confident that we can clear up the old resources but I can run test builds if desired (I don't think it's necessary). FWIW in terms of storage I noticed we have an unattached 1000GB at Dallas 5 -- anyone have any idea what that is? I'm assuming if it's unattached it can be deleted? I've also started a "disaster recovery plan" document over in #2634 with pointers to where we have backups/alternative places to recover configuration. |
We appear to only have one server at Dallas 5: but that isn't the current It appears the one at Dallas 5 was replaced at some point in the past #2480 / #1074. So we can probably get rid of the Dallas 5 server and the unattached storage? FWIW I also don't think it's a great idea to have both of the test centos6 x64 hosts at the same datacenter with the same cloud provider (i.e. an outage could potentially take out all of our centos6 x64 test hosts) but maybe we can retire centos 6 entirely when Node.js 10 goes End-of-Life at the end of the month (only three days left!). |
I'm pretty confident the unattached 1000GB disk can go, nothing lost there. Mostly these cases of unattached disks are a result of a failure to cleanup after some kind of migration (to a new DC or to resize the disk). We don't use unattached disks anywhere in our storage strategy so if nobody claims it for very-recent use (like the unattached 300GB old jenkins disk) then it can go. |
@richardlau Can this be closed now? It's still sitting as a pinned issue so shows up and gives me a minor panic each time I go to the issue list in this repository :-) |
We still need to clear up the old resources, but this definitely doesn't need to remain pinned. |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
After landing nodejs/nodejs-dist-indexer#15, I checked https://nodejs.org/download/nightly/, to see if I'd missed the nightly, and noticed that we haven't had a nightly build since 17 April but I know the master branch of the core repo has been updated since then.
From https://ci-release.nodejs.org/log/all
The text was updated successfully, but these errors were encountered: