Restrict status codes raised by the library #595

emcfarlane · 2023-09-20T19:33:57Z

Connect-go raises error codes that fall outside what the gRPC spec recommends for a framework: https://github.com/grpc/grpc/blob/master/doc/statuscodes.md

The following status codes are never generated by the library:

INVALID_ARGUMENT

NOT_FOUND

ALREADY_EXISTS

FAILED_PRECONDITION

ABORTED

OUT_OF_RANGE

DATA_LOSS

INVALID_ARGUMENT and FAILED_PRECONDITION from a quick grep. I think most of these cases should be changed to Internal errors.

The text was updated successfully, but these errors were encountered:

akshayjshah · 2023-09-25T04:50:39Z

Connect returns INVALID_ARGUMENT in some cases. FAILED_PRECONDITION is only used in a test.

I'm very reluctantly okay with changing our use of INVALID_ARGUMENT to INTERNAL and UNAVAILABLE, matching gRPC. It does seem deeply odd to raise the equivalent of a 5xx when the client sends the server garbage data, though. (And this makes the Connect protocol return 404s if the client uses a compression mechanism that the server doesn't support, which is also quite odd.)

emcfarlane · 2023-10-03T16:25:27Z

Clarifying the server errors codes should help users debug errors. Any INVALID_ARGUMENT will come from user code, ie. an invalid field. Mangled payloads I think should be server errors as this is a bug in the library or some other internal issue.

akshayjshah · 2023-10-03T21:24:04Z

Mangled payloads can just as easily come from the client's RPC runtime. I've never seen a REST server treat a mangled JSON request payload as a 5xx - it seems pretty standard to me to assume that mangled payloads are the client's fault (or some middlebox's fault).

timostamm · 2023-10-04T13:32:53Z

I've never seen a REST server treat a mangled JSON request payload as a 5xx

Anything but a 400 is unexpected in my book. The gRPC protocol does not use HTTP status codes and is rarely used with JSON. Are we sure we want to forego semantic HTTP status codes for the Connect implementation because of gRPC?

It would be really nice if there was a set of error codes reserved for users. It would help with debugging and it would let a user-facing application decide which error messages to show to the user verbatim. I don't think that restricting codes raised by the library gets us there though: It leaves just 7 codes that are guaranteed to originate from the server application, and it doesn't give any guarantees about the inverse (since a server application can raise any code).

So I wonder if what gRPC does is actually very helpful in practice.

emcfarlane · 2023-10-04T17:54:39Z

Ideally it would be a 400 HTTP status but not INVALID_ARGUMENT error code as I do think that implies a user issue not a marshalling error.

If both client and server generate the request a 500 seems reasonable to me as this is a issue with the client/server lib, but yep this isn't RESTful.

akshayjshah · 2023-10-12T18:08:35Z

@jhump and @mattrobenolt Curious about your thoughts on this one.

mattrobenolt · 2023-10-12T18:15:48Z

To be explicit, if this is referring specifically to the codes we spit back for the gRPC protocol, and not the connect protocol, I think we should mirror the behavior of grpc-go and what the spec says.

I haven't gone through and audited each use of them and compared to what the spec says, but I think it'd make sense to try and be as compatible to expectations.

Granted, the only real downside I see is this might be considered an API breaking change? But I don't feel that strongly about it.

As a quick run through our/my uses, we don't do any checks that particularly compare the status codes. Anything not-OK is usually treated the same, and typically it's either pass/fail, with the code and status as just effectively metadata.

jhump · 2023-10-13T12:09:59Z

I agree with Timo, that anything other than a 400 is bad for the Connect protocol.

But, TBH, we should perhaps make an appeal to the gRPC team to change the spec, because using "internal" for gRPC error codes is also bad for the same reason: it makes observability and alerting a nightmare because you can no longer correctly distinguish between "errors caused by a bad client" and "errors caused by my server".

I've never worked at a place where dev ops didn't distinguish these cases for alerting and exception-reporting config. So having a client-induced error reported as "internal" is bad. With HTTP/REST, you'd generally distinguish by classifying status codes as 4xx vs 5xx. With gRPC, the best you can do is to statically classify each gRPC error code as either "client" or "server", and this breaks down when a code like "internal" can mean either.

emcfarlane · 2023-10-26T18:34:37Z

Interestingly gRPC clients that receive a 400 response from a service map to INTERNAL as per: https://github.com/grpc/grpc/blob/master/doc/http-grpc-status-mapping.md
I'd argue UNKNOWN codes are more severe then INTERNAL in most cases.

OTEL recently adjusted the gRPC status codes to map closer to HTTP semantics: open-telemetry/opentelemetry-specification#3333
From the PR:

on invalid grpc_timeout returns 400 and a codes.Internal error: https://github.com/grpc/grpc-go/blob/a02aae6168aa683a1f106e285ec10eb6bc1f6ded/internal/transport/handler_server.go#L90-L92

emcfarlane · 2024-08-07T15:02:24Z

This is fixed with the conformance tests and the spec update for error codes.

emcfarlane added the bug Something isn't working label Sep 20, 2023

emcfarlane closed this as completed Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict status codes raised by the library #595

Restrict status codes raised by the library #595

emcfarlane commented Sep 20, 2023

akshayjshah commented Sep 25, 2023

emcfarlane commented Oct 3, 2023

akshayjshah commented Oct 3, 2023 •

edited

Loading

timostamm commented Oct 4, 2023

emcfarlane commented Oct 4, 2023

akshayjshah commented Oct 12, 2023

mattrobenolt commented Oct 12, 2023

jhump commented Oct 13, 2023

emcfarlane commented Oct 26, 2023

emcfarlane commented Aug 7, 2024

Restrict status codes raised by the library #595

Restrict status codes raised by the library #595

Comments

emcfarlane commented Sep 20, 2023

akshayjshah commented Sep 25, 2023

emcfarlane commented Oct 3, 2023

akshayjshah commented Oct 3, 2023 • edited Loading

timostamm commented Oct 4, 2023

emcfarlane commented Oct 4, 2023

akshayjshah commented Oct 12, 2023

mattrobenolt commented Oct 12, 2023

jhump commented Oct 13, 2023

emcfarlane commented Oct 26, 2023

emcfarlane commented Aug 7, 2024

akshayjshah commented Oct 3, 2023 •

edited

Loading