Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for migrating VM to a new propolis instance. #447

Merged
merged 29 commits into from
Jan 25, 2022

Conversation

luqmana
Copy link
Contributor

@luqmana luqmana commented Nov 29, 2021

Building on the live migration work in propolis (oxidecomputer/propolis#69), this adds the nexus endpoint to trigger a migration of an instance.

Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty awesome PR - from HTTP endpoints, to sagas, to sled calls into the Propolis server, this is an impressively cross-cutting PR, which was ready way faster than I would have expected. Nice job getting this together!

// TODO correctness TODO robustness TODO design
// Should we lookup the instance again here?
// See comment in project_create_instance.
let instance = self.db_datastore.instance_fetch(&instance.id()).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not actionable)

So unlike project_create_instance, we do already have an instance object we fetched back on line 869.
We could plausibly save a copy of it in-memory, and then return it "as we expect it to be transformed" back to the user.

I dunno if that presents a different race condition compared to the one described in project_create -- there, it seemed like we wanted to return a version of the object "exactly as it was created" (which might race with concurrent ops immediately after creation, but before query), and here, we'd want to return a copy of the instance "precisely after it was moved" (which might also race with concurrent ops before or after the migration itself).

Dunno. Seems like we'll either need to lock the DB row somehow, or define the bounds of "what could be seen as a return value from this endpoint" more explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I kinda went back and forth over whether I should even bother returning the Instance back here. I think it'll become clearer once we have the actual system (i.e. not oxapi_demo) in place that will be using this endpoint.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could plausibly have a similar solution to #447 (comment) - basically, identify that the instance is "migrating", prevent concurrent operations if that flag is set, and atomically "fetch + update" to complete the migration and grab the state of the instance afterwards.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this API would (eventually) be async (and this question would be moot) -- is that right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davepacheco as in, "the caller would have to issue a second call to fetch the instance, and we'd just return whatever currently exists in the DB to them", right?

nexus/src/sagas.rs Outdated Show resolved Hide resolved
nexus/src/sagas.rs Outdated Show resolved Hide resolved
nexus/src/sagas.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance.rs Outdated Show resolved Hide resolved
sled-agent/src/instance_manager.rs Outdated Show resolved Hide resolved
Comment on lines +399 to +467
let dst_sa = osagactx
.sled_client(&dst_sled_uuid)
.await
.map_err(ActionError::action_failed)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the order of operations in this saga step is:

  1. Fetch the instance from DB
  2. Send "ensure + migrate" request to the destination instance, using the metadata for the instance we got in (1).
  3. Update the "runtime state" in the DB

Is it possible we run into a TOCTTOU issue here?

What if someone modifies the sled (perhaps sends a different ensure request, gets a response) in between our "fetch" and "ensure" requests? Wouldn't we send an out-of-date sled request?

I think this problem could be avoided with some combination of:

  1. Durable state changes (explicitly modifying the DB, indicating that the instance is in the progress of migration, and preventing concurrent modification until this flag is removed).
  2. Using a generation number to only perform an operation if we explicitly ensure we haven't raced with anyone else.

https://rfd.shared.oxide.computer/rfd/0192#_conditional_updates seems related

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this problem still exists, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this was resolved by moving out "migration prep" to an earlier saga step, to set the "migration" state and block other operations which depend on the state/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we ask sled agent to idempotently update the state to migrating (via Nexus) as part of an earlier action. Only if that action succeeds will we move on to the subsequent actions which does the actual migration.

@luqmana luqmana force-pushed the live-migrate branch 2 times, most recently from 64963a6 to 7f61111 Compare December 1, 2021 06:01
// the same ID, that would inadvertantly remove our newly created
// instance instead.
// TODO: cleanup source instance properly
std::mem::forget(old_instance);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for backpedaling, but I think I'm seeing your perspective from #447 (comment) here... maybe the "null ticket" wouldn't be so bad.

I have a minor beef with this call to forget - we intend to drop the ticket, but forgetting the entire instance may also ignore other drop functions that could need to execute. I don't think those exist today, but there isn't anything preventing them from existing.

I think that justifies moving it out of the old instance, rather than creating a new ticket.

Something like:

// Ensure that the instance inserted into the `instances` mapping has a tracking
// membership ticket, so it can be safely removed when it goes out-of-scope.
let ticket = {
  if let Some(old_instance) = old_instance {
    // If we migrate an instance within the sled, we should keep it alive in the
    // `instances` mapping. Although the Propolis ID changes, the user-visible
    // instance ID remains the same, and uses the same key.
    assert!(migrate.is_some());
    std::mem::replace(old_instance.running_state.ticket, InstanceTicket::null());
  } else {
    InstanceTicket::new(instance_id, self.inner.clone())
  }
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor beef with this call to forget - we intend to drop the ticket, but forgetting the entire instance may also ignore other drop functions that could need to execute. I don't think those exist today, but there isn't anything preventing them from existing.

Well, long-term I don't plan to just forget here :P Probably need to keep it around in case we need to rollback from a recoverable failure or just for cleaning up once the actual migration is done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this for visibility - It looks like we're still calling forget here; wasn't sure if we wanted to patch prior to merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this still needs to be addressed but I planned on doing it in a followup PR. It will change slightly as I plumb more the propolis-side migration work which relies on both instances being available.

nexus/src/sagas.rs Outdated Show resolved Hide resolved
@luqmana luqmana force-pushed the live-migrate branch 2 times, most recently from 368821c to 472ccf1 Compare December 1, 2021 21:46
@ahl
Copy link
Contributor

ahl commented Dec 2, 2021

if you either git checkout Cargo.lock or cargo update -p dropshot you should be all good... at least vis-a-vis this openapi issue (sorry!)

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really cool to see this come together!

// TODO correctness TODO robustness TODO design
// Should we lookup the instance again here?
// See comment in project_create_instance.
let instance = self.db_datastore.instance_fetch(&instance.id()).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this API would (eventually) be async (and this question would be moot) -- is that right?

nexus/src/sagas.rs Outdated Show resolved Hide resolved
@luqmana luqmana force-pushed the live-migrate branch 2 times, most recently from c103b07 to c64b209 Compare January 6, 2022 23:02
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my comments are mostly focused on:

  • Concurrency safety
  • Idempotency safety

I don't wanna drag the lifetime of this PR out too far, but I do think we should try to track these issues before submitting. If you wanna push back on any of my comments, I'd be more okay with a reduced set of functionality (e.g., "no attaching / detaching devices during a migration!") than features which are potentially buggy (e.g., "You can attach/detach devices during a migration, but it may result in race conditions").

nexus/src/external_api/http_entrypoints.rs Outdated Show resolved Hide resolved
// TODO correctness TODO robustness TODO design
// Should we lookup the instance again here?
// See comment in project_create_instance.
let instance = self.db_datastore.instance_fetch(&instance.id()).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davepacheco as in, "the caller would have to issue a second call to fetch the instance, and we'd just return whatever currently exists in the DB to them", right?

nexus/src/sagas.rs Outdated Show resolved Hide resolved
nexus/src/sagas.rs Outdated Show resolved Hide resolved
nexus/src/sagas.rs Outdated Show resolved Hide resolved
}

// Update the Instance's runtime state to indicate it's currently migrating.
// This also acts as a lock somewhat to prevent any further state changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this so? E.g., we aren't prevented from adding/removing disks/nics during this time, are we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nexus currently has a check based on the current instance state to decide whether certain operations are allowed (like rebooting it). I extended that check to also disallow such changes during migration. But you're right, those checks don't currently exist for modifying disks or nics. But, last I checked those operations didn't actually modify the instance yet (e.g. SledAgent::sisk_ensure is just a todo!).

Comment on lines +399 to +467
let dst_sa = osagactx
.sled_client(&dst_sled_uuid)
.await
.map_err(ActionError::action_failed)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this problem still exists, right?

Comment on lines +102 to +119
// Instance does not exist or one does but we're performing
// a intra-sled migration. Either way - create an instance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if "ensure" has been called repeatedly for this migration? I'm trying to eyeball how this function can still be idempotent if the following behavior occurs:

  1. "Ensure" called, requesting a migration. Instance now exists.
  2. "Ensure" called again, while the migration is in-progress, because a saga step is being called repeatedly.

It seems like this results in creating a brand new propolis instance, and terminating the old instance, no? Couldn't this potentially interrupt migration in progress, by basically "throwing away the destination" instance and creating it again?

(Maybe this is intended behavior, but I figured I'd check - kinda seems like we'd want to leave the destination alone, or potentially just alter attached devices, if it is already communicating with the source instance)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, I'd probably also need to pass along both the src & dst UUIDs in the migrate request here so that we can determine if the migration already happened? Admittedly the idempotent story is a little ill-defined so far and is definitely something I need to audit more fine-grainedly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems possible to infer a handful of ways:

  • If the "source_propolis" UUID == "destination_propolis" UUID
  • If the migration UUID was seen before

I do still think it's important to handle though, otherwise, the behavior of tossing the old instance would disrupt the migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, updated it such that if the current instance's propolis UUID matches the target propolis UUID, we treat it as an ongoing migration and not spin up a new instance.

nexus/src/sagas.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, modulo one last minor idempotency thing

nexus/src/nexus.rs Outdated Show resolved Hide resolved
nexus/src/nexus.rs Outdated Show resolved Hide resolved
nexus/src/nexus.rs Show resolved Hide resolved
sled-agent/src/common/instance.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for considering all these cases!

@luqmana luqmana merged commit 4d12a4d into main Jan 25, 2022
@luqmana luqmana deleted the live-migrate branch January 25, 2022 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants