-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for migrating VM to a new propolis instance. #447
Conversation
74da061
to
07e5922
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty awesome PR - from HTTP endpoints, to sagas, to sled calls into the Propolis server, this is an impressively cross-cutting PR, which was ready way faster than I would have expected. Nice job getting this together!
// TODO correctness TODO robustness TODO design | ||
// Should we lookup the instance again here? | ||
// See comment in project_create_instance. | ||
let instance = self.db_datastore.instance_fetch(&instance.id()).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not actionable)
So unlike project_create_instance
, we do already have an instance object we fetched back on line 869.
We could plausibly save a copy of it in-memory, and then return it "as we expect it to be transformed" back to the user.
I dunno if that presents a different race condition compared to the one described in project_create
-- there, it seemed like we wanted to return a version of the object "exactly as it was created" (which might race with concurrent ops immediately after creation, but before query), and here, we'd want to return a copy of the instance "precisely after it was moved" (which might also race with concurrent ops before or after the migration itself).
Dunno. Seems like we'll either need to lock the DB row somehow, or define the bounds of "what could be seen as a return value from this endpoint" more explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I kinda went back and forth over whether I should even bother returning the Instance back here. I think it'll become clearer once we have the actual system (i.e. not oxapi_demo) in place that will be using this endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could plausibly have a similar solution to #447 (comment) - basically, identify that the instance is "migrating", prevent concurrent operations if that flag is set, and atomically "fetch + update" to complete the migration and grab the state of the instance afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this API would (eventually) be async (and this question would be moot) -- is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davepacheco as in, "the caller would have to issue a second call to fetch the instance, and we'd just return whatever currently exists in the DB to them", right?
let dst_sa = osagactx | ||
.sled_client(&dst_sled_uuid) | ||
.await | ||
.map_err(ActionError::action_failed)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the order of operations in this saga step is:
- Fetch the instance from DB
- Send "ensure + migrate" request to the destination instance, using the metadata for the instance we got in (1).
- Update the "runtime state" in the DB
Is it possible we run into a TOCTTOU issue here?
What if someone modifies the sled (perhaps sends a different ensure
request, gets a response) in between our "fetch" and "ensure" requests? Wouldn't we send an out-of-date sled request?
I think this problem could be avoided with some combination of:
- Durable state changes (explicitly modifying the DB, indicating that the instance is in the progress of migration, and preventing concurrent modification until this flag is removed).
- Using a generation number to only perform an operation if we explicitly ensure we haven't raced with anyone else.
https://rfd.shared.oxide.computer/rfd/0192#_conditional_updates seems related
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this problem still exists, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this was resolved by moving out "migration prep" to an earlier saga step, to set the "migration" state and block other operations which depend on the state/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, we ask sled agent to idempotently update the state to migrating (via Nexus) as part of an earlier action. Only if that action succeeds will we move on to the subsequent actions which does the actual migration.
64963a6
to
7f61111
Compare
// the same ID, that would inadvertantly remove our newly created | ||
// instance instead. | ||
// TODO: cleanup source instance properly | ||
std::mem::forget(old_instance); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for backpedaling, but I think I'm seeing your perspective from #447 (comment) here... maybe the "null ticket" wouldn't be so bad.
I have a minor beef with this call to forget
- we intend to drop the ticket, but forgetting the entire instance may also ignore other drop
functions that could need to execute. I don't think those exist today, but there isn't anything preventing them from existing.
I think that justifies moving it out of the old instance, rather than creating a new ticket.
Something like:
// Ensure that the instance inserted into the `instances` mapping has a tracking
// membership ticket, so it can be safely removed when it goes out-of-scope.
let ticket = {
if let Some(old_instance) = old_instance {
// If we migrate an instance within the sled, we should keep it alive in the
// `instances` mapping. Although the Propolis ID changes, the user-visible
// instance ID remains the same, and uses the same key.
assert!(migrate.is_some());
std::mem::replace(old_instance.running_state.ticket, InstanceTicket::null());
} else {
InstanceTicket::new(instance_id, self.inner.clone())
}
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a minor beef with this call to forget - we intend to drop the ticket, but forgetting the entire instance may also ignore other drop functions that could need to execute. I don't think those exist today, but there isn't anything preventing them from existing.
Well, long-term I don't plan to just forget here :P Probably need to keep it around in case we need to rollback from a recoverable failure or just for cleaning up once the actual migration is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumping this for visibility - It looks like we're still calling forget
here; wasn't sure if we wanted to patch prior to merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this still needs to be addressed but I planned on doing it in a followup PR. It will change slightly as I plumb more the propolis-side migration work which relies on both instances being available.
368821c
to
472ccf1
Compare
if you either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really cool to see this come together!
// TODO correctness TODO robustness TODO design | ||
// Should we lookup the instance again here? | ||
// See comment in project_create_instance. | ||
let instance = self.db_datastore.instance_fetch(&instance.id()).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this API would (eventually) be async (and this question would be moot) -- is that right?
c103b07
to
c64b209
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my comments are mostly focused on:
- Concurrency safety
- Idempotency safety
I don't wanna drag the lifetime of this PR out too far, but I do think we should try to track these issues before submitting. If you wanna push back on any of my comments, I'd be more okay with a reduced set of functionality (e.g., "no attaching / detaching devices during a migration!") than features which are potentially buggy (e.g., "You can attach/detach devices during a migration, but it may result in race conditions").
// TODO correctness TODO robustness TODO design | ||
// Should we lookup the instance again here? | ||
// See comment in project_create_instance. | ||
let instance = self.db_datastore.instance_fetch(&instance.id()).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davepacheco as in, "the caller would have to issue a second call to fetch the instance, and we'd just return whatever currently exists in the DB to them", right?
nexus/src/sagas.rs
Outdated
} | ||
|
||
// Update the Instance's runtime state to indicate it's currently migrating. | ||
// This also acts as a lock somewhat to prevent any further state changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this so? E.g., we aren't prevented from adding/removing disks/nics during this time, are we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nexus currently has a check based on the current instance state to decide whether certain operations are allowed (like rebooting it). I extended that check to also disallow such changes during migration. But you're right, those checks don't currently exist for modifying disks or nics. But, last I checked those operations didn't actually modify the instance yet (e.g. SledAgent::sisk_ensure
is just a todo!
).
let dst_sa = osagactx | ||
.sled_client(&dst_sled_uuid) | ||
.await | ||
.map_err(ActionError::action_failed)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this problem still exists, right?
// Instance does not exist or one does but we're performing | ||
// a intra-sled migration. Either way - create an instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if "ensure" has been called repeatedly for this migration? I'm trying to eyeball how this function can still be idempotent if the following behavior occurs:
- "Ensure" called, requesting a migration. Instance now exists.
- "Ensure" called again, while the migration is in-progress, because a saga step is being called repeatedly.
It seems like this results in creating a brand new propolis instance, and terminating the old instance, no? Couldn't this potentially interrupt migration in progress, by basically "throwing away the destination" instance and creating it again?
(Maybe this is intended behavior, but I figured I'd check - kinda seems like we'd want to leave the destination alone, or potentially just alter attached devices, if it is already communicating with the source instance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, I'd probably also need to pass along both the src & dst UUIDs in the migrate request here so that we can determine if the migration already happened? Admittedly the idempotent story is a little ill-defined so far and is definitely something I need to audit more fine-grainedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this seems possible to infer a handful of ways:
- If the "source_propolis" UUID == "destination_propolis" UUID
- If the migration UUID was seen before
I do still think it's important to handle though, otherwise, the behavior of tossing the old instance would disrupt the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, updated it such that if the current instance's propolis UUID matches the target propolis UUID, we treat it as an ongoing migration and not spin up a new instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, modulo one last minor idempotency thing
…mon code to InstanceInner::ensure.
…evisit whether we want to serialize the Instance model as a whole separately.
…rialization errors.
… of doing it in nexus saga.
… to try to be idempotent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for considering all these cases!
Building on the live migration work in propolis (oxidecomputer/propolis#69), this adds the nexus endpoint to trigger a migration of an instance.