Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") #350

Closed
kltm opened this issue Jan 2, 2024 · 10 comments

Comments

@kltm
Copy link
Member

kltm commented Jan 2, 2024

As of Jan 1st, summary emails are no longer sent on an error like:

18:29:52  + sshfs -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o IdentityFile=**** -o idmap=user skyhook@skyhook.berkeleybop.org:/home/skyhook /var/lib/jenkins/workspace/ssue-go-site-1530-summary-emails/mnt/
18:29:52  read: Connection reset by peer

Given the timing, my gut guess is that the key "expired" or something, as this has run like clockwork until now. That said, before digging in, I don't think we need to mount, right? What is that section?

@kltm
Copy link
Member Author

kltm commented Jan 2, 2024

Some kind of "key problem"; now failing with:

21:25:19  + scp -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o IdentityFile=**** ont-title.txt skyhook@skyhook.berkeleybop.org:/home/skyhook/issue-go-site-1530-summary-emails/reports/
21:25:19  /var/lib/jenkins/.ssh/config line 3: Unsupported option "rsaauthentication"
21:25:19  Permission denied, please try again.
21:25:19  Permission denied, please try again.
21:25:19  skyhook@skyhook.berkeleybop.org: Permission denied (publickey,password).
21:25:19  lost connection

...or maybe there was a quiet ssh update?

@kltm
Copy link
Member Author

kltm commented Jan 2, 2024

Technically, emails can be sent again (by removing anything that was having trouble); that said, I'm keeping this open until I can track down what changed and revert the reporting saves.

@kltm
Copy link
Member Author

kltm commented Jan 2, 2024

Okay, affecting all pipelines.

@kltm kltm transferred this issue from geneontology/go-site Jan 2, 2024
@kltm kltm changed the title Summary emails are no longer sent ("pipeline" error) Key / ssh error in pipeline ( was "Summary emails are no longer sent ("pipeline" error)") Jan 2, 2024
@kltm kltm changed the title Key / ssh error in pipeline ( was "Summary emails are no longer sent ("pipeline" error)") (was "Key / ssh error in pipeline...") Jan 3, 2024
@kltm kltm changed the title (was "Key / ssh error in pipeline...") Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") Jan 3, 2024
@kltm
Copy link
Member Author

kltm commented Jan 3, 2024

Okay, I've tracked the issue and it is not what I was expecting. Basically, some process has /wiped/ skyhook's home directory. This is either a manual error or one of the pipelines is setup incorrectly and is taking a swing at everything.

I think we reported this somewhere before, but I can't find the ticket. I think at the time I assumed a "manual" error; this time, given the timing, I'm fairly sure it's an issue in a Jenkinsfile.

Okay, my notes have it at 6 months ago on June 1st. That is sus. I'm going to rebuild skyhook and then start tracking files by their crontab.

Rebuilding skyhook.

@kltm
Copy link
Member Author

kltm commented Jan 4, 2024

I now have SOP notes for recovering the skyhook user/directory. For various TMI reasons, I'm going to keep those private for the moment. The machine has all recovery mechanisms chugging along; hopefully no more manual steps needed while resetting.
Next: find the cause.

@kltm
Copy link
Member Author

kltm commented Jan 4, 2024

Nothing found in crontabs.
Pipelines that have run or tried to run recently:
go-ontology-dev
issue-35-neo-test
full-issue-325-gopreprocess
goa-copy-to-mirror
snapshot
issue-go-site-1530-summary-emails
release
...that's irritating as these run regularly with no issue.

@kltm
Copy link
Member Author

kltm commented Jan 4, 2024

Timing-wise, that leaves some questions.
Looking at go-ontology-dev, it was successful with (Dec 31, 2023, 4:00 PM and failed with the "wiped" errors at (Jan 1, 2024, 12:00 AM). Technically speaking, 00:01:06 AM.
Just before that, we have an insta-fail on release with

ERROR: Failed to clean the workspace
jenkins.util.io.CompositeIOException: Unable to delete '/var/lib/jenkins/workspace/neontology_pipeline_release-L3OLSRDNGI3ZIUODKFYUI4AO45X5C6RUGMOQAC5WV2Q6ZQOIFHMA'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.

Note that this is before any stage. It's failing on the checkout attempt. Hm.

@kltm
Copy link
Member Author

kltm commented Jan 4, 2024

Okay, I have a theory.

Looking at the function

// Reset and initialize skyhook base.
void initialize() {
    // Get a mount point ready
[..]
    sh 'rm -r -f $WORKSPACE/mnt/$BRANCH_NAME || true'
[...]

What would happen if, somehow, $BRANCH_NAME was not defined. Somehow. This would have the effect of scouring skyhook. That should not be possible...but it is the only place where an "unprotected" delete occurs like that.

My theory is that the pipeline still managed to "run" enough to fail (unknown mechanism) but, since the pipeline had not run enough to define $BRANCH_NAME (let's posit that magic), but just enough to have code in place that an alternate thread (magic) managed to get to initialize(); if that happened, skyhook would get toasted.

@kltm
Copy link
Member Author

kltm commented Jan 5, 2024

Testing on master now.

@kltm
Copy link
Member Author

kltm commented Jan 5, 2024

Passed. Now propagating.

kltm added a commit that referenced this issue Jan 5, 2024
kltm added a commit that referenced this issue Jan 5, 2024
@kltm kltm closed this as completed Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant