server: return better HTTP errors when not ensured #649

hawkw · 2024-02-21T19:34:13Z

Currently, if the propolis-server APIs that require a running instance are called when the server has not received an InstanceEnsure request, we return a 404. Sled-agent and Nexus handle this status code by assuming they have done something wrong, such as attempting to call an API route that the server doesn't support. This results in issues where sled-agent assumes that it's the one that made a mistake when it tries to call APIs on a propolis-server that has lost its VM instance due to being restarted after a panic.

To make it easier for callers distinguish between errors that indicate "this server doesn't know about a running instance" and errors that indicate "you attempted to call an API route that I don't know about", I've changed the status code returned by propolis-servers that haven't created an instance to HTTP 424 Failed Dependency. The best thing about returning this status code is that it's "not 404", but it does arguably have the most correct semantics for this specific error:

The HTTP 424 Failed Dependency client error response code indicates
that the method could not be performed on the resource because the
requested action depended on another action, and that action failed.

On the other hand, this is kind of a weird status code that's mostly used by WebDAV servers, so I'm open to using something else. We could alternatively return 410 Gone in these cases, but that status is already returned by servers whose VM instance has been deleted. I thought it might be nice to be able to distinguish between "not ensured" and "ensured, and then deleted"...but we could switch to 410 Gone here, too.

gjcolombo · 2024-02-22T20:44:36Z

bin/propolis-server/src/lib/server.rs

@@ -1065,6 +1052,16 @@ pub fn api() -> ApiDescription<Arc<DropshotEndpointContext>> {
    api
 }

+const NO_INSTANCE: &str = "NO_INSTANCE";


Do we intend for sled agent to start looking at the error_code field of the HTTP errors it gets back from Propolis calls? I'm all in favor of doing this (even if the conventions of them require them to be--ugh--strings and not, y'know, numbers), but if that's the intent I think this definition should live in a separate crate that can be an ancestor to both propolis-server and propolis-client (which can re-export the definition to someplace where sled agent can get it). propolis_types seems like about the right place if we want to go that way.

I'm working on addressing this suggestion before we merge this PR, thanks!

Addressed this in fca00f9, although AFAICT there's not a nice way to get the dropshot client to re-hydrate them to the Rust enum type yet (see oxidecomputer/dropshot#644 and oxidecomputer/dropshot#41, etc)...

gjcolombo · 2024-02-22T21:34:55Z

bin/propolis-server/src/lib/server.rs

+fn not_created_error() -> HttpError {
+    HttpError::for_client_error(
+        Some(NO_INSTANCE.to_string()),
+        http::StatusCode::FAILED_DEPENDENCY,


Soapbox rant, not necessarily a problem with the change: I get what we're going for here, but I've had bad luck in the past with using slightly off-meaning status codes in this way--they have a tendency, IME, to get bubbled up to a user somehow, and then the user reads how the status was originally described in whatever document or header defined it, and then folks get confused about what's really happening. In this case, RFC 4918 defines 424 Failed Dependency to mean that

The 424 (Failed Dependency) status code means that the method could
not be performed on the resource because the requested action
depended on another action and that action failed. For example, if a
command in a PROPPATCH method fails, then, at minimum, the rest of
the commands will also fail with 424 (Failed Dependency).

That's definitely close, but in the scenario that led us here (Propolis panic) no earlier requested actions actually failed--the problem is that the server reset.

Anyway, this is a soapbox rant; I don't think 424's semantics are far enough off from what we intend for this to be a blocker. But if we are leaning toward using descriptive strings in the error_code field (at least for internal control plane communication) I might be inclined to lean on those and stick to the HTTP error codes in RFC 9110.

That's definitely close, but in the scenario that led us here (Propolis panic) no earlier requested actions actually failed--the problem is that the server reset.

My rationale was that in the context of this particular propolis-server process, we depended on an earlier action (InstanceEnsure) which has not been performed at all. So, I guess indicating that it failed is not quite correct, but, hmm...

We discussed this offline and decided to stick with 424 Failed Dependency for now, to distinguish this case from the 404 Not Found you get from requesting a nonexistent route and the 410 Gone you get from trying to query a VM's state after it has already shut down and been destroyed.

Currently, if the `propolis-server` APIs that require a running instance are called when the server has not received an `InstanceEnsure` request, we return a 404. Sled-agent and Nexus handle this status code by assuming *they* have done something wrong, such as attempting to call an API route that the server doesn't support. This results in issues where sled-agent assumes that it's the one that made a mistake when it tries to call APIs on a `propolis-server` that has lost its VM instance due to being restarted after a panic. To make it easier for callers distinguish between errors that indicate "this server doesn't know about a running instance" and errors that indicate "you attempted to call an API route that I don't know about", I've changed the status code returned by `propolis-server`s that haven't created an instance to HTTP [424 Failed Dependency]. The best thing about returning this status code is that it's "not 404", but it does arguably have the most correct semantics for this specific error: > The HTTP 424 Failed Dependency client error response code indicates > that the method could not be performed on the resource because the > requested action depended on another action, and that action failed. On the other hand, this is kind of a weird status code that's mostly used by WebDAV servers, so I'm open to using something else. We could alternatively return 410 Gone in these cases, but that status is already returned by servers whose VM instance has been deleted. I thought it might be nice to be able to distinguish between "not ensured" and "ensured, and then deleted"...but we could switch to 410 Gone here, too. [424 Failed Dependency]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/424

hawkw requested a review from gjcolombo February 21, 2024 19:34

gjcolombo reviewed Feb 22, 2024

View reviewed changes

gjcolombo approved these changes Feb 22, 2024

View reviewed changes

smklein approved these changes Apr 10, 2024

View reviewed changes

hawkw added 4 commits April 10, 2024 10:46

more descriptive names

6a97df7

update expected statuses in test

686d48b

move error codes to an enum propolis-api-types

3ad59e9

hawkw force-pushed the eliza/dude-wheres-my-instance branch from fca00f9 to 3ad59e9 Compare April 10, 2024 17:46

hawkw merged commit 14b0892 into master Apr 10, 2024
10 checks passed

hawkw deleted the eliza/dude-wheres-my-instance branch April 10, 2024 18:18

hawkw mentioned this pull request Oct 1, 2024

[sled-agent] Unbreak handling of Propolis error codes oxidecomputer/omicron#6726

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: return better HTTP errors when not ensured #649

server: return better HTTP errors when not ensured #649

hawkw commented Feb 21, 2024

gjcolombo Feb 22, 2024

hawkw Mar 7, 2024

hawkw Apr 10, 2024

gjcolombo Feb 22, 2024

hawkw Feb 22, 2024

gjcolombo Feb 22, 2024

server: return better HTTP errors when not ensured #649

server: return better HTTP errors when not ensured #649

Conversation

hawkw commented Feb 21, 2024

gjcolombo Feb 22, 2024

Choose a reason for hiding this comment

hawkw Mar 7, 2024

Choose a reason for hiding this comment

hawkw Apr 10, 2024

Choose a reason for hiding this comment

gjcolombo Feb 22, 2024

Choose a reason for hiding this comment

hawkw Feb 22, 2024

Choose a reason for hiding this comment

gjcolombo Feb 22, 2024

Choose a reason for hiding this comment