-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: return better HTTP errors when not ensured #649
Conversation
@@ -1065,6 +1052,16 @@ pub fn api() -> ApiDescription<Arc<DropshotEndpointContext>> { | |||
api | |||
} | |||
|
|||
const NO_INSTANCE: &str = "NO_INSTANCE"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend for sled agent to start looking at the error_code
field of the HTTP errors it gets back from Propolis calls? I'm all in favor of doing this (even if the conventions of them require them to be--ugh--strings and not, y'know, numbers), but if that's the intent I think this definition should live in a separate crate that can be an ancestor to both propolis-server
and propolis-client
(which can re-export the definition to someplace where sled agent can get it). propolis_types
seems like about the right place if we want to go that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on addressing this suggestion before we merge this PR, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this in fca00f9, although AFAICT there's not a nice way to get the dropshot client to re-hydrate them to the Rust enum type yet (see oxidecomputer/dropshot#644 and oxidecomputer/dropshot#41, etc)...
fn not_created_error() -> HttpError { | ||
HttpError::for_client_error( | ||
Some(NO_INSTANCE.to_string()), | ||
http::StatusCode::FAILED_DEPENDENCY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Soapbox rant, not necessarily a problem with the change: I get what we're going for here, but I've had bad luck in the past with using slightly off-meaning status codes in this way--they have a tendency, IME, to get bubbled up to a user somehow, and then the user reads how the status was originally described in whatever document or header defined it, and then folks get confused about what's really happening. In this case, RFC 4918 defines 424 Failed Dependency to mean that
The 424 (Failed Dependency) status code means that the method could
not be performed on the resource because the requested action
depended on another action and that action failed. For example, if a
command in a PROPPATCH method fails, then, at minimum, the rest of
the commands will also fail with 424 (Failed Dependency).
That's definitely close, but in the scenario that led us here (Propolis panic) no earlier requested actions actually failed--the problem is that the server reset.
Anyway, this is a soapbox rant; I don't think 424's semantics are far enough off from what we intend for this to be a blocker. But if we are leaning toward using descriptive strings in the error_code
field (at least for internal control plane communication) I might be inclined to lean on those and stick to the HTTP error codes in RFC 9110.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely close, but in the scenario that led us here (Propolis panic) no earlier requested actions actually failed--the problem is that the server reset.
My rationale was that in the context of this particular propolis-server
process, we depended on an earlier action (InstanceEnsure
) which has not been performed at all. So, I guess indicating that it failed is not quite correct, but, hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline and decided to stick with 424 Failed Dependency for now, to distinguish this case from the 404 Not Found you get from requesting a nonexistent route and the 410 Gone you get from trying to query a VM's state after it has already shut down and been destroyed.
Currently, if the `propolis-server` APIs that require a running instance are called when the server has not received an `InstanceEnsure` request, we return a 404. Sled-agent and Nexus handle this status code by assuming *they* have done something wrong, such as attempting to call an API route that the server doesn't support. This results in issues where sled-agent assumes that it's the one that made a mistake when it tries to call APIs on a `propolis-server` that has lost its VM instance due to being restarted after a panic. To make it easier for callers distinguish between errors that indicate "this server doesn't know about a running instance" and errors that indicate "you attempted to call an API route that I don't know about", I've changed the status code returned by `propolis-server`s that haven't created an instance to HTTP [424 Failed Dependency]. The best thing about returning this status code is that it's "not 404", but it does arguably have the most correct semantics for this specific error: > The HTTP 424 Failed Dependency client error response code indicates > that the method could not be performed on the resource because the > requested action depended on another action, and that action failed. On the other hand, this is kind of a weird status code that's mostly used by WebDAV servers, so I'm open to using something else. We could alternatively return 410 Gone in these cases, but that status is already returned by servers whose VM instance has been deleted. I thought it might be nice to be able to distinguish between "not ensured" and "ensured, and then deleted"...but we could switch to 410 Gone here, too. [424 Failed Dependency]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/424
fca00f9
to
3ad59e9
Compare
Currently, if the
propolis-server
APIs that require a running instance are called when the server has not received anInstanceEnsure
request, we return a 404. Sled-agent and Nexus handle this status code by assuming they have done something wrong, such as attempting to call an API route that the server doesn't support. This results in issues where sled-agent assumes that it's the one that made a mistake when it tries to call APIs on apropolis-server
that has lost its VM instance due to being restarted after a panic.To make it easier for callers distinguish between errors that indicate "this server doesn't know about a running instance" and errors that indicate "you attempted to call an API route that I don't know about", I've changed the status code returned by
propolis-server
s that haven't created an instance to HTTP 424 Failed Dependency. The best thing about returning this status code is that it's "not 404", but it does arguably have the most correct semantics for this specific error:On the other hand, this is kind of a weird status code that's mostly used by WebDAV servers, so I'm open to using something else. We could alternatively return 410 Gone in these cases, but that status is already returned by servers whose VM instance has been deleted. I thought it might be nice to be able to distinguish between "not ensured" and "ensured, and then deleted"...but we could switch to 410 Gone here, too.