Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ssh): enable OS Login for GCP test instances #5602

Merged
merged 21 commits into from
Nov 16, 2022
Merged

Conversation

gustavovalverde
Copy link
Member

@gustavovalverde gustavovalverde commented Nov 9, 2022

Motivation

We've been dealing with several different issues related to SSH connections. So far one of the solutions was to have a fixed SSH-key to connect from GitHub Actions, but this also reduces flexibility and maintainability, and goes against our security-wise decision of not having long-lived artifacts (JSON tokens, SSH keys) without expiration, which could be used in any point in time.

This PR implements OS Login as workaround (and possible root cause solution)

Note: Not all changes to make this work are being done in GitHub Actions, as some tasks are tied to changes in the infrastructure

Fixes: #5494
Fixes: #5069

Specifications

https://cloud.google.com/compute/docs/oslogin#benefits_of_os_login

Designs

  • Activate OS Login on specific VMs
  • Give the GitHub Actions service account the required permissions to connect as a user

Solution

Review

Anyone from DevOps after running this PR successfully ~5+ times

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
  • How do you know it works? Does it have tests?

Follow Up Work

Configure OS-login project-wide after this PR has been merged.

@gustavovalverde gustavovalverde added A-infrastructure Area: Infrastructure changes A-devops Area: Pipelines, CI/CD and Dockerfiles C-enhancement Category: This is an improvement P-High 🔥 I-integration-fail Continuous integration fails, including build and test failures labels Nov 9, 2022
@gustavovalverde gustavovalverde self-assigned this Nov 9, 2022
@github-actions github-actions bot added C-bug Category: This is a bug C-feature Category: New features C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG labels Nov 9, 2022
@teor2345
Copy link
Contributor

teor2345 commented Nov 9, 2022

Sounds interesting, hope it works!

We've been dealing with several different issues related to SSH connections. ...
This PR implements OS Login as workaround (and possible root cause solution)

How will we test that this works before merging?
Do we need to run this PR multiple times to make sure SSH is more reliable?

Note: Not all changes to make this work are being done in GitHub Actions, as some tasks are tied to changes in the infrastructure

Can you document the infrastructure changes somewhere?

@gustavovalverde
Copy link
Member Author

How will we test that this works before merging?
Do we need to run this PR multiple times to make sure SSH is more reliable?

Yes, running this PR multiple times is only thing I can think we have available at our disposal right now.

Can you document the infrastructure changes somewhere?

I'll be translating the manual changes to a terraform config

@gustavovalverde
Copy link
Member Author

Seems like the authentication issue I've been having is related to google-github-actions/setup-gcloud#586

Which was fixed 12 hours ago.

Previous behavior:
`gcloud` commands have been running without an appropiate authentication
as the `auth` auction was sucessfully executed, but the actual gcloud
CLI being used in further jobs was not using the correct configuration
nor credentials

Expected behavior:
All `gcloud` commands should be properly configured and authenticated.

Solution:
Add the `google-github-actions/setup-gcloud` action after each
`google-github-actions/auth` invocation, and before running any `gcloud`
command.

Remove the need of an OAuth Access token when not required by following
steps
@gustavovalverde gustavovalverde requested a review from a team as a code owner November 10, 2022 21:50
@gustavovalverde gustavovalverde requested review from arya2 and removed request for a team November 10, 2022 21:50
@teor2345
Copy link
Contributor

I've just realized that other running workflows will affect this implementation, as those will be continuously creating SSH key pairs and uploading the public key to GCP's metadata; and then failing with the following error message: Permission denied (publickey), as those keys can conflict with the ones being used by this implementation.

That's really unfortunate, it makes it tricky to test and merge.

How can we make sure that SSH is more reliable with this change?

Did you want to delay this change to the start of next week, so you and @dconnolly can fix any issues quickly?
(I won't be available, unfortunately.)

@gustavovalverde
Copy link
Member Author

I started testing this with os-login enabled project wide.

Attempt #1: ✅ Failed, but wasn't an SSH error (https://github.com/ZcashFoundation/zebra/actions/runs/3440425995/jobs/5754296381#step:6:65)

Attempt #1.5: ✅ Success

Co-authored-by: Deirdre Connolly <durumcrustulum@gmail.com>
@gustavovalverde gustavovalverde marked this pull request as draft November 15, 2022 18:10
@gustavovalverde
Copy link
Member Author

I removed several SSH keys with the following command

for i in $(gcloud compute os-login ssh-keys list --format="table[no-heading](value.fingerprint)") --impersonate-service-account=github-service-account@zealous-zebra.iam.gserviceaccount.com; do 
  echo "$i";
gcloud compute os-login ssh-keys remove --key "$i" --impersonate-service-account=github-service-account@zealous-zebra.iam.gserviceaccount.com || true;
  done

And re-running. I'll update this command later once this run finishes.

I might also increase --ssh-key-expire-after to 5m as there's an edge case which can cause specific jobs to fail https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-ssh-errors#:~:text=Your%20key%20expired%20and%20Compute%20Engine%20deleted%20your%20~/.ssh/authorized_keys%20file

@dconnolly
Copy link
Contributor

I removed several SSH keys with the following command

for i in $(gcloud compute os-login ssh-keys list --format="table[no-heading](value.fingerprint)") --impersonate-service-account=github-service-account@zealous-zebra.iam.gserviceaccount.com; do 
  echo "$i";
gcloud compute os-login ssh-keys remove --key "$i" --impersonate-service-account=github-service-account@zealous-zebra.iam.gserviceaccount.com || true;
  done

And re-running. I'll update this command later once this run finishes.

I might also increase --ssh-key-expire-after to 5m as there's an edge case which can cause specific jobs to fail https://cloud.google.com/compute/docs/troubleshooting/troubleshooting-ssh-errors#:~:text=Your%20key%20expired%20and%20Compute%20Engine%20deleted%20your%20~/.ssh/authorized_keys%20file

Once added, should this PR move out of draft and be priortized to merge? We're seeing the 32KB limit being hit on these jobs in multiple places as the switch got flipped on the gcloud side

dconnolly
dconnolly previously approved these changes Nov 15, 2022
@gustavovalverde
Copy link
Member Author

We're seeing the 32KB limit being hit on these jobs in multiple places as the switch got flipped on the gcloud side

Now that you say this, maybe those other jobs are the ones generating all the SSH-key "garbage" in GCP, even though I applied the policy on GCP side.

I think we can merge this, delete all keys and update all branches. Worst case scenario is reverting this, but it's hard to test with other jobs running and colliding with this. I'll move it out of draft and let you approve if you think this is a sane approach.

@gustavovalverde gustavovalverde marked this pull request as ready for review November 16, 2022 00:42
arya2
arya2 previously approved these changes Nov 16, 2022
@arya2 arya2 dismissed their stale review November 16, 2022 01:25

Let's wait for Deirdre

@arya2
Copy link
Contributor

arya2 commented Nov 16, 2022

maybe those other jobs are the ones generating all the SSH-key "garbage" in GCP

This was my first thought when seeing it happen in the one other PR that has the Delete temporal SSH keys job

mergify bot added a commit that referenced this pull request Nov 16, 2022
@gustavovalverde
Copy link
Member Author

@Mergifyio refresh

@mergify
Copy link
Contributor

mergify bot commented Nov 16, 2022

refresh

✅ Pull request refreshed

mergify bot added a commit that referenced this pull request Nov 16, 2022
@mergify mergify bot merged commit 844ebf0 into main Nov 16, 2022
@mergify mergify bot deleted the use-fixed-ssh-key branch November 16, 2022 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles A-infrastructure Area: Infrastructure changes C-bug Category: This is a bug C-enhancement Category: This is an improvement C-feature Category: New features C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG I-integration-fail Continuous integration fails, including build and test failures
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gcloud ssh: Connection reset by peer Fix "Connection closed by remote host" error on Google Cloud ssh
4 participants