Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") #350

kltm · 2024-01-02T04:54:27Z

As of Jan 1st, summary emails are no longer sent on an error like:

18:29:52  + sshfs -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o IdentityFile=**** -o idmap=user skyhook@skyhook.berkeleybop.org:/home/skyhook /var/lib/jenkins/workspace/ssue-go-site-1530-summary-emails/mnt/
18:29:52  read: Connection reset by peer

Given the timing, my gut guess is that the key "expired" or something, as this has run like clockwork until now. That said, before digging in, I don't think we need to mount, right? What is that section?

The text was updated successfully, but these errors were encountered:

…ite#2215

kltm · 2024-01-02T05:27:15Z

Some kind of "key problem"; now failing with:

21:25:19  + scp -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o IdentityFile=**** ont-title.txt skyhook@skyhook.berkeleybop.org:/home/skyhook/issue-go-site-1530-summary-emails/reports/
21:25:19  /var/lib/jenkins/.ssh/config line 3: Unsupported option "rsaauthentication"
21:25:19  Permission denied, please try again.
21:25:19  Permission denied, please try again.
21:25:19  skyhook@skyhook.berkeleybop.org: Permission denied (publickey,password).
21:25:19  lost connection

...or maybe there was a quiet ssh update?

…y/go-site#2215

kltm · 2024-01-02T06:34:54Z

Technically, emails can be sent again (by removing anything that was having trouble); that said, I'm keeping this open until I can track down what changed and revert the reporting saves.

kltm · 2024-01-02T22:52:03Z

Okay, affecting all pipelines.

kltm · 2024-01-03T22:20:17Z

Okay, I've tracked the issue and it is not what I was expecting. Basically, some process has /wiped/ skyhook's home directory. This is either a manual error or one of the pipelines is setup incorrectly and is taking a swing at everything.

I think we reported this somewhere before, but I can't find the ticket. I think at the time I assumed a "manual" error; this time, given the timing, I'm fairly sure it's an issue in a Jenkinsfile.

Okay, my notes have it at 6 months ago on June 1st. That is sus. I'm going to rebuild skyhook and then start tracking files by their crontab.

Rebuilding skyhook.

kltm · 2024-01-04T01:20:54Z

I now have SOP notes for recovering the skyhook user/directory. For various TMI reasons, I'm going to keep those private for the moment. The machine has all recovery mechanisms chugging along; hopefully no more manual steps needed while resetting.
Next: find the cause.

kltm · 2024-01-04T01:34:52Z

Nothing found in crontabs.
Pipelines that have run or tried to run recently:
go-ontology-dev
issue-35-neo-test
full-issue-325-gopreprocess
goa-copy-to-mirror
snapshot
issue-go-site-1530-summary-emails
release
...that's irritating as these run regularly with no issue.

kltm · 2024-01-04T02:11:51Z

Timing-wise, that leaves some questions.
Looking at go-ontology-dev, it was successful with (Dec 31, 2023, 4:00 PM and failed with the "wiped" errors at (Jan 1, 2024, 12:00 AM). Technically speaking, 00:01:06 AM.
Just before that, we have an insta-fail on release with

ERROR: Failed to clean the workspace
jenkins.util.io.CompositeIOException: Unable to delete '/var/lib/jenkins/workspace/neontology_pipeline_release-L3OLSRDNGI3ZIUODKFYUI4AO45X5C6RUGMOQAC5WV2Q6ZQOIFHMA'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.

Note that this is before any stage. It's failing on the checkout attempt. Hm.

kltm · 2024-01-04T02:16:25Z

Okay, I have a theory.

Looking at the function

// Reset and initialize skyhook base.
void initialize() {
    // Get a mount point ready
[..]
    sh 'rm -r -f $WORKSPACE/mnt/$BRANCH_NAME || true'
[...]

What would happen if, somehow, $BRANCH_NAME was not defined. Somehow. This would have the effect of scouring skyhook. That should not be possible...but it is the only place where an "unprotected" delete occurs like that.

My theory is that the pipeline still managed to "run" enough to fail (unknown mechanism) but, since the pipeline had not run enough to define $BRANCH_NAME (let's posit that magic), but just enough to have code in place that an alternate thread (magic) managed to get to initialize(); if that happened, skyhook would get toasted.

kltm · 2024-01-05T01:23:06Z

Testing on master now.

kltm · 2024-01-05T19:53:05Z

Passed. Now propagating.

kltm added the bug (B: affects usability) label Jan 2, 2024

kltm referenced this issue Jan 2, 2024

try and remove 'problematic' remote mount code; for geneontology/go-s…

7e2b71a

…ite#2215

kltm referenced this issue Jan 2, 2024

okay, want to wait less between tries; for geneontology/go-site#2215

3292fcc

kltm referenced this issue Jan 2, 2024

no longer care for any mounts--I just want the email; for geneontolog…

5791295

…y/go-site#2215

kltm added bug (A: showstopper) and removed bug (B: affects usability) labels Jan 2, 2024

kltm transferred this issue from geneontology/go-site Jan 2, 2024

kltm changed the title ~~Summary emails are no longer sent ("pipeline" error)~~ Key / ssh error in pipeline ( was "Summary emails are no longer sent ("pipeline" error)") Jan 2, 2024

sierra-moxon mentioned this issue Jan 3, 2024

Update main pipeline output to produce usable GPAD/GPI 2.0 geneontology/go-site#2043

Closed

kltm changed the title ~~Key / ssh error in pipeline ( was "Summary emails are no longer sent ("pipeline" error)")~~ (was "Key / ssh error in pipeline...") Jan 3, 2024

kltm changed the title ~~(was "Key / ssh error in pipeline...")~~ Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") Jan 3, 2024

kltm mentioned this issue Jan 4, 2024

WormBase upstream having issues; use bypass and revert when ready #338

Open

kltm added a commit that referenced this issue Jan 5, 2024

attempt initialize() protection for #350

8fce541

kltm added a commit that referenced this issue Jan 5, 2024

update for #350

a217e58

kltm added a commit that referenced this issue Jan 5, 2024

update for #350

bcd135c

kltm closed this as completed Jan 5, 2024

kltm added this to Software essential and proactive maintenance Aug 22, 2024

kltm moved this to Done in Software essential and proactive maintenance Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") #350

Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") #350

kltm commented Jan 2, 2024

kltm commented Jan 2, 2024 •

edited

Loading

kltm commented Jan 2, 2024 •

edited

Loading

kltm commented Jan 2, 2024

kltm commented Jan 3, 2024

kltm commented Jan 4, 2024

kltm commented Jan 4, 2024

kltm commented Jan 4, 2024

kltm commented Jan 4, 2024 •

edited

Loading

kltm commented Jan 5, 2024

kltm commented Jan 5, 2024

Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") #350

Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") #350

Comments

kltm commented Jan 2, 2024

kltm commented Jan 2, 2024 • edited Loading

kltm commented Jan 2, 2024 • edited Loading

kltm commented Jan 2, 2024

kltm commented Jan 3, 2024

kltm commented Jan 4, 2024

kltm commented Jan 4, 2024

kltm commented Jan 4, 2024

kltm commented Jan 4, 2024 • edited Loading

kltm commented Jan 5, 2024

kltm commented Jan 5, 2024

kltm commented Jan 2, 2024 •

edited

Loading

kltm commented Jan 2, 2024 •

edited

Loading

kltm commented Jan 4, 2024 •

edited

Loading