Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User specified SSH keys to inject at instance create time #4764

Merged
merged 48 commits into from
Jan 31, 2024

Conversation

benjaminleonard
Copy link
Contributor

@benjaminleonard benjaminleonard commented Jan 5, 2024

Fixes #3056

This PR adds an optional ssh_keys parameter to the instance create payload. If it's empty the same behavior as we have today is retained: all ssh keys from a user's profile are transferred over. If a vector of ssh key name or IDs are passed then only transfers over those given keys. If an empty vector is passed no ssh keys will be transferred.

There's also a /v1/instances/<instance>/ssh-public-keys API that lists all public keys that were sent via cloud-init to the instance at instance create. I talked to @iliana a bit about the proper response to this API. Currently it returns the full SshKey view. We may want to alter that to transmit fewer details.

Other changes:

  • Pulled SSH keys app code into its own file

Remaining Work

Preview Give feedback

Followup Work

Tasks

Preview Give feedback
  1. zephraph
  2. api
    david-crespo
  3. api

@zephraph zephraph self-assigned this Jan 12, 2024
Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really looking forward to this feature! Thanks a bunch for taking the time to flesh it out so thoroughly

nexus/db-model/src/schema.rs Show resolved Hide resolved
nexus/db-queries/src/db/datastore/ssh_key.rs Outdated Show resolved Hide resolved
Comment on lines +130 to +131
// If the keys are None, use the fallback behavior of assigning all the users keys
None => {
Copy link
Contributor

@karencfv karencfv Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this not confuse users thinking that None means add no keys? Perhaps we may want to revisit why the default behaviour is to add all keys.

There are a few reasons we may want to reconsider this default behaviour:

  • From a security standpoint, people may have forgotten to remove keys from their user account that have been compromised.
  • Additionally, users may not be aware that None means all, it doesn't feel particularly intuitive (often people don't read the docs!). They may end up adding keys they did not intend to add, and be unaware of it if the instance was not created through the console.
  • Do we expect our customers to particularly want to have all keys automatically added to the instances they create?

I think it'd also be good to have some input from @ahl and/or @askfongjojo on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if users have more than 10 keys saved in their user account, every attempt at creating an instance with the default behaviour of adding all keys would result in an error.

This means that someone with more than 10 keys would have to set which keys they want added every single time to avoid a failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, yeah, I've had a lot of conversations about this. @iliana and I discussed this in chat. This implementation was xer original suggestion.

There's a few things to consider here.

  1. Should this be a required field? If it's required then it's one extra thing folks have to specify when creating an instance. Originally I was worried about the repercussions of this decision (and others like it) on the CLI. That's mostly moot given that most args have to be provided via a json file. If it's not required, I think the only sane default is to send them all (as we do today). Having the default mean none get sent would lead to workflow situations where users forget and then lock themselves out (which is worse in my mind than having to go and remove some keys manually after the fact)
  2. Should we even have the option for users to send all keys to the instance? I feel like yes is probably the answer here because I suspect most users will have few keys. That's also another reason I've leaned towards this behavior as the default. If we didn't preserve this option then users would always have to specify keys by name/id in the API (lest they can't SSH into the instance) and that feels like a pretty big DX regression. That said, I would appreciate more feedback from folks on this.

So, yeah, an explicit enum having AllFromProfile | Only[keys] | NoKeys is certainly an option I can implement in this PR if that's the consensus. I do see the value of it, especially around forcing users to make a choice. If the choice that they're forced to make is often the same as the default of this PR though, that feels not great.

Copy link
Contributor

@iliana iliana Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like breaking old clients. Requiring an explicit enum will be a flag day for any existing code. This is, I think, an extremely good candidate for being our first version bump on an API. We did give ourselves a number of backstops to be able to declare a new version of the API, but we haven't used any of them yet (as far as I'm aware?), so that sort of explodes the work needed to be done here.

If we want to land this in the immediate term, we should keep the implicit behavior to use all of the user's keys if the field is not present (i.e. None). But ideally we have a clearer v2 either way.

Copy link
Contributor

@karencfv karencfv Jan 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments both! These are my thoughts:

Should this be a required field? If it's required then it's one extra thing folks have to specify when creating an instance. Originally I was worried about the repercussions of this decision (and others like it) on the CLI. That's mostly moot given that most args have to be provided via a json file. If it's not required, I think the only sane default is to send them all (as we do today). Having the default mean none get sent would lead to workflow situations where users forget and then lock themselves out (which is worse in my mind than having to go and remove some keys manually after the fact)

To be perfectly honest, I don't see this being a required field as a problem. IMO it's better to be explicit about what you're doing than get surprises down the line. Alternatively, if we don't want this to be a required field, we could also have a concept of a "default" SSH key which gets added to any instance where SSH keys haven't been defined. WDYT?

We also need to think of the use case where new users are migrating their workloads from their previous vendor. For example: a user that didn't have support for adding SSH keys at create time (cough VMware cough) will have had to either have a user data init script that creates authentication, or have an image with a predefined user, and add additional SSH keys later. In this scenario, users may want to initially conserve their current workflows as part of a big lift and shift operation. These migrations are no small thing. and we really need to think about how to make it as seamless as possible for our customers to migrate over without unintended consequences.

Automatically adding every SSH key the user has saved feels quite nuclear, and can lead to easily giving unintended access. Especially if you're in the type or role where your job is to set up infrastructure for others to use and have saved other people's public keys in your account.

FWIW, I don't see being locked out of your instance as a necessarily terrible thing. This is something the user would immediately notice and would only need to delete and recreate their instance.

Should we even have the option for users to send all keys to the instance? I feel like yes is probably the answer here because I suspect most users will have few keys. That's also another reason I've leaned towards this behavior as the default. If we didn't preserve this option then users would always have to specify keys by name/id in the API (lest they can't SSH into the instance) and that feels like a pretty big DX regression. That said, I would appreciate more feedback from folks on this.

I wouldn't necessarily say most users would have a small amount of keys, it really depends. Also, users that manage infrastructure will have ssh keys of others. So it's not just how many keys are saved, but also to whom those keys are giving access to. Frankly, at this point I don't think we have enough information about what most users will be doing. To me the big questions for now are:

  • Does anyone really need to have every key they have saved added to their instance on create?
  • What is more annoying? Adding a flag to define ssh keys on instance create, or having to go into an instance and remove SSH keys that you realised you really shouldn't have in there later down the track.
  • Which repercussions are worse? Unintentionally giving access to keys that shouldn't have this access (could be compromised keys the user forgot to remove or people who shouldn't have access), or having to recreate an instance because you forgot to define a key and got locked out (this could be mitigated by having a "default" ssh key).

Additionally, by automatically adding all keys we'd be making it harder for our users to follow the principle of least privilege

I really don't like breaking old clients. Requiring an explicit enum will be a flag day for any existing code. This is, I think, an extremely good candidate for being our first version bump on an API. We did give ourselves a number of backstops to be able to declare a new version of the API, but we haven't used any of them yet (as far as I'm aware?), so that sort of explodes the work needed to be done here.

As for the concern about breaking APIs, I think the consensus is that now is the moment to be breaking the API to make sure we get it right. Posting a recent comment here -> #4694 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on breaking the API and including no keys if none are defined.

However making it required might be a poor user experience in the CLI. If the user doesn't want to include keys, they shouldn't need that argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate all the discussion here. @karencfv you bring up a lot of great points and I feel like there's just ultimately some more product discovery we need to do here.

I have higher priorities than this PR and I've already sank too much time into it. I'm just going to create an issue to encapsulate the context here and we can solve this in a follow up. The default behavior as implemented is the same as what we have today so it shouldn't be disruptive to include in this release while also giving us the time we need to figure out how we want this to be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the follow up issue to tackle this work: #4919

nexus/src/app/mod.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/instance_create.rs Outdated Show resolved Hide resolved

// Gather the SSH public keys of the actor make the request so
// that they may be injected into the new image via cloud-init.
// TODO-security: this should be replaced with a lookup based on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above. Can you open an issue to track this and reference it here so this isn't forgotten about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add this TODO, just copied it over from where it was originally. I don't actually know what the original intent is here and I somewhat assume it's out of date. @davepacheco, I think you originally wrote this. Can you give any more context? I'm happy to file an issue, but I'd just like to understand more what the replacement should be (if it's even still relevant)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. Looking through the history I believe this came in with #974. Maybe @plotnick remembers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zowie, that was a while ago; thanks, @davepacheco for digging up that PR (#974). That TODO is mentioned in its description:

Right now, we collect only the keys belonging to the actor making the request. Once role assignments for Silo users are in place, we can ask for, e.g., all the keys belonging to all the users with an appropriate role (vms.instance.create, probably) on the instance.

I don't remember exactly where that desire came from, other than generally wanting to allow customers to assign permissions at the role/group level rather than for individual users (as the current lookup would require); but we didn't have any kind of roles at the time, so couldn't do it then. If it's still desirable and we now have something like Silo/IdP roles (I've been away from that layer of Omicron for a while, so don't really know), it could certainly be spun out into an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @plotnick! Do you mind opening up an issue for this?

nexus/src/external_api/http_entrypoints.rs Outdated Show resolved Hide resolved
Comment on lines +2672 to +2673
/// Note that this list is a snapshot in time and will not reflect updates made after
/// the instance is created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like this clarification

* Entries are added here when an instance is created (with configured SSH keys)
* and removed when the instance is destroyed.
*
* TODO: Should this have time created / time deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just commenting here since the PR is set as ready to review. This is intended to be added in this PR yes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, but not necessarily. Currently I have this relationship configured to be hard deleted when the instance is (soft) deleted. This was a weird case being that it's both a join and not something a user can directly change. It's created at instance create and deleted at instance delete.

If I do need to add the deleted timestamp, I certainly will.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha! nw, I just wanted to make sure this wasn't something that was intended to be included in this PR

Co-authored-by: Karen Cárcamo <karencfv@users.noreply.github.com>
@zephraph zephraph requested a review from iliana January 26, 2024 16:54
Copy link
Contributor

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating follow up issues for product discovery for instance SSH keys. I'm OK with revising all of that at a later date, and merging this PR (once the tests pass) for now so we can introduce new functionality in this release.

Thanks again for taking the time to look into this!


// Gather the SSH public keys of the actor make the request so
// that they may be injected into the new image via cloud-init.
// TODO-security: this should be replaced with a lookup based on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @plotnick! Do you mind opening up an issue for this?

Comment on lines 135 to 138
lines
.trim()
.lines()
.find(|line| line.contains(public_key_str.clone().as_str()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem correct. This section of code is trying to extract the host's SSH key from the console output, so we can authenticate the host when we try to connect to it. That is not the same key as the one we created and sent to the instance to authenticate our user.

If we are failing to extract a host SSH key from the console output, either we aren't waiting long enough or something else is wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was extracting a key, but the key it was extracting was wrong. I thought that for some reason there were multiple.

@iliana
Copy link
Contributor

iliana commented Jan 30, 2024

Looking at the deploy test run for 88ddaa6, the failure does not implicate the host key parsing. The relevant snippet of stdout:

generate SSH key
create SSH key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBx78Gk6UIlPOJT7JC888SVHUOKwFJ29yJp2cCOv1TVZ
create disk
create instance
instance external IP: 10.151.1.176
waiting for serial console
host key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHeStG8cgDDVntEMVlMHWkDlwZhMQFBlkja61JQk9gCi
connecting ssh
authenticating ssh
Error: authentication failed

This implicates this code:

eprintln!("connecting ssh");
let mut session = russh::client::connect(
Default::default(),
(ip_addr, 22),
SshClient { host_key },
)
.await?;
eprintln!("authenticating ssh");
ensure!(
session.authenticate_publickey("debian", key).await?,
"authentication failed"
);

Which implies the connection was established, the host key was authenticated, and what failed to occur was authenticating with our private key. If we extract the cloud_init_bytes from the sled-agent log, the key is present in meta-data:

debian@meow:~$ sudo mount ./wat.img /mnt
debian@meow:~$ cat /mnt/meta-data
{"instance-id":"5d3b43e8-7186-4356-8bb3-45731e23a1c2","local-hostname":"localshark","ssh-keys":["ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBx78Gk6UIlPOJT7JC888SVHUOKwFJ29yJp2cCOv1TVZ"]}

However, that is the wrong JSON key. cloud-init expects public-keys:

debian@meow:~$ sudo mount /dev/vda /mnt
mount: /mnt: WARNING: source write-protected, mounted read-only.
debian@meow:~$ cat /mnt/meta-data
{"instance-id":"302d68c3-c7d1-4b1c-a154-b96db67fbb6b","local-hostname":"meow","public-keys":["ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOGmLom09EvwjDo6hSwzBf3kNdf/sMw/lPiswy3u4HR9 iliana@horizon","ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIECO1ghEFVs0WIFJ5mXvMq0GqIaBb4CTbexL5IYLohZ1 iliana@keysmash","ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDDC/oZFROia+ElMQ0cp3GD2g3/06YoZhA5EsrlKxT2N iliana@redwood"]}

The changes to nexus/src/cidata.rs — namely, renaming any of the members of the MetaData struct — are causing the failure. I'm looking for a specific reference on what this file needs to contain from cloud-init's perspective so we can add an instructive comment.

@iliana
Copy link
Contributor

iliana commented Jan 30, 2024

I'm looking for a specific reference on what this file needs to contain from cloud-init's perspective so we can add an instructive comment.

Well, I'm not finding any, so I might just have to blog about it...

@zephraph zephraph merged commit 5780ff6 into main Jan 31, 2024
21 checks passed
@zephraph zephraph deleted the instance-create-ssh-keys branch January 31, 2024 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Let user specify SSH keys to inject at instance create time
8 participants