-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated (?) process periodically wipes skyhook home directory, requiring rebuild (was "Key / ssh error in pipeline...") #350
Comments
Some kind of "key problem"; now failing with:
...or maybe there was a quiet ssh update? |
Technically, emails can be sent again (by removing anything that was having trouble); that said, I'm keeping this open until I can track down what changed and revert the reporting saves. |
Okay, affecting all pipelines. |
Okay, I've tracked the issue and it is not what I was expecting. Basically, some process has /wiped/ skyhook's home directory. This is either a manual error or one of the pipelines is setup incorrectly and is taking a swing at everything. I think we reported this somewhere before, but I can't find the ticket. I think at the time I assumed a "manual" error; this time, given the timing, I'm fairly sure it's an issue in a Jenkinsfile. Okay, my notes have it at 6 months ago on June 1st. That is sus. I'm going to rebuild skyhook and then start tracking files by their crontab. Rebuilding skyhook. |
I now have SOP notes for recovering the skyhook user/directory. For various TMI reasons, I'm going to keep those private for the moment. The machine has all recovery mechanisms chugging along; hopefully no more manual steps needed while resetting. |
Nothing found in crontabs. |
Timing-wise, that leaves some questions.
Note that this is before any stage. It's failing on the checkout attempt. Hm. |
Okay, I have a theory. Looking at the function
What would happen if, somehow, $BRANCH_NAME was not defined. Somehow. This would have the effect of scouring skyhook. That should not be possible...but it is the only place where an "unprotected" delete occurs like that. My theory is that the pipeline still managed to "run" enough to fail (unknown mechanism) but, since the pipeline had not run enough to define $BRANCH_NAME (let's posit that magic), but just enough to have code in place that an alternate thread (magic) managed to get to |
Testing on |
Passed. Now propagating. |
As of Jan 1st, summary emails are no longer sent on an error like:
Given the timing, my gut guess is that the key "expired" or something, as this has run like clockwork until now. That said, before digging in, I don't think we need to mount, right? What is that section?
The text was updated successfully, but these errors were encountered: