Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: return better HTTP errors when not ensured #649

Merged
merged 4 commits into from
Apr 10, 2024

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Feb 21, 2024

Currently, if the propolis-server APIs that require a running instance are called when the server has not received an InstanceEnsure request, we return a 404. Sled-agent and Nexus handle this status code by assuming they have done something wrong, such as attempting to call an API route that the server doesn't support. This results in issues where sled-agent assumes that it's the one that made a mistake when it tries to call APIs on a propolis-server that has lost its VM instance due to being restarted after a panic.

To make it easier for callers distinguish between errors that indicate "this server doesn't know about a running instance" and errors that indicate "you attempted to call an API route that I don't know about", I've changed the status code returned by propolis-servers that haven't created an instance to HTTP 424 Failed Dependency. The best thing about returning this status code is that it's "not 404", but it does arguably have the most correct semantics for this specific error:

The HTTP 424 Failed Dependency client error response code indicates
that the method could not be performed on the resource because the
requested action depended on another action, and that action failed.

On the other hand, this is kind of a weird status code that's mostly used by WebDAV servers, so I'm open to using something else. We could alternatively return 410 Gone in these cases, but that status is already returned by servers whose VM instance has been deleted. I thought it might be nice to be able to distinguish between "not ensured" and "ensured, and then deleted"...but we could switch to 410 Gone here, too.

@hawkw hawkw requested a review from gjcolombo February 21, 2024 19:34
@@ -1065,6 +1052,16 @@ pub fn api() -> ApiDescription<Arc<DropshotEndpointContext>> {
api
}

const NO_INSTANCE: &str = "NO_INSTANCE";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend for sled agent to start looking at the error_code field of the HTTP errors it gets back from Propolis calls? I'm all in favor of doing this (even if the conventions of them require them to be--ugh--strings and not, y'know, numbers), but if that's the intent I think this definition should live in a separate crate that can be an ancestor to both propolis-server and propolis-client (which can re-export the definition to someplace where sled agent can get it). propolis_types seems like about the right place if we want to go that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on addressing this suggestion before we merge this PR, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in fca00f9, although AFAICT there's not a nice way to get the dropshot client to re-hydrate them to the Rust enum type yet (see oxidecomputer/dropshot#644 and oxidecomputer/dropshot#41, etc)...

fn not_created_error() -> HttpError {
HttpError::for_client_error(
Some(NO_INSTANCE.to_string()),
http::StatusCode::FAILED_DEPENDENCY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Soapbox rant, not necessarily a problem with the change: I get what we're going for here, but I've had bad luck in the past with using slightly off-meaning status codes in this way--they have a tendency, IME, to get bubbled up to a user somehow, and then the user reads how the status was originally described in whatever document or header defined it, and then folks get confused about what's really happening. In this case, RFC 4918 defines 424 Failed Dependency to mean that

The 424 (Failed Dependency) status code means that the method could
not be performed on the resource because the requested action
depended on another action and that action failed. For example, if a
command in a PROPPATCH method fails, then, at minimum, the rest of
the commands will also fail with 424 (Failed Dependency).

That's definitely close, but in the scenario that led us here (Propolis panic) no earlier requested actions actually failed--the problem is that the server reset.

Anyway, this is a soapbox rant; I don't think 424's semantics are far enough off from what we intend for this to be a blocker. But if we are leaning toward using descriptive strings in the error_code field (at least for internal control plane communication) I might be inclined to lean on those and stick to the HTTP error codes in RFC 9110.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely close, but in the scenario that led us here (Propolis panic) no earlier requested actions actually failed--the problem is that the server reset.

My rationale was that in the context of this particular propolis-server process, we depended on an earlier action (InstanceEnsure) which has not been performed at all. So, I guess indicating that it failed is not quite correct, but, hmm...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline and decided to stick with 424 Failed Dependency for now, to distinguish this case from the 404 Not Found you get from requesting a nonexistent route and the 410 Gone you get from trying to query a VM's state after it has already shut down and been destroyed.

Currently, if the `propolis-server` APIs that require a running instance
are called when the server has not received an `InstanceEnsure` request,
we return a 404. Sled-agent and Nexus handle this status code by
assuming *they* have done something wrong, such as attempting to call an
API route that the server doesn't support. This results in issues where
sled-agent assumes that it's the one that made a mistake when it tries
to call APIs on a `propolis-server` that has lost its VM instance due to
being restarted after a panic.

To make it easier for callers distinguish between errors that indicate
"this server doesn't know about a running instance" and errors that
indicate "you attempted to call an API route that I don't know about",
I've changed the status code returned by `propolis-server`s that haven't
created an instance to HTTP [424 Failed Dependency]. The best thing
about returning this status code is that it's "not 404", but it does
arguably have the most correct semantics for this specific error:

> The HTTP 424 Failed Dependency client error response code indicates
> that the method could not be performed on the resource because the
> requested action depended on another action, and that action failed.

On the other hand, this is kind of a weird status code that's mostly
used by WebDAV servers, so I'm open to using something else. We could
alternatively return 410 Gone in these cases, but that status is already
returned by servers whose VM instance has been deleted. I thought it
might be nice to be able to distinguish between "not ensured" and
"ensured, and then deleted"...but we could switch to 410 Gone here, too.

[424 Failed Dependency]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/424
@hawkw hawkw force-pushed the eliza/dude-wheres-my-instance branch from fca00f9 to 3ad59e9 Compare April 10, 2024 17:46
@hawkw hawkw merged commit 14b0892 into master Apr 10, 2024
10 checks passed
@hawkw hawkw deleted the eliza/dude-wheres-my-instance branch April 10, 2024 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants