-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsqlpb: whitelist node unavailability errors #37367
Conversation
@RaduBerinde this is the change I was telling you about. I need a way to ensure that new distsql execution nodes (with this patch in) are not being used by old gateways (without this patch) so we don't get a crash upon non-decodable errors. What's the right combination of expected versions that will give me that guarantee? |
Sounds like you need to bump both distsqlrun.Version and MinAcceptedVersion to 23. Is this something you want to backport to 19.1.x? (bumping the version could be too disruptive for a point release) |
I wasn't directly considering a backport, but if we don't we'll want to decide what to do about #37215 in that version. |
@andreimatei do you have further thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/sql/distsqlpb/data.proto, line 43 at r2 (raw file):
pgerror.Error pg_error = 1 [(gogoproto.customname) = "PGError"]; roachpb.UnhandledRetryableError retryableTxnError = 2; roachpb.NodeUnavailableError nodeUnavailableError = 3;
rather than starting to add random errors here, wouldn't it be better to turn NodeUnavailableErr
into a pgerror? Even one with the internal error code. Nobody in DistSQL needs to recognize this error I believe.
Prior to this patch, a distsql gateway would crash if it received an error payload of a type it didn't know about. This is unfair to the user, as an error (regardless of payload) is just an error. This patch removes the panic and produces a valid error (with a Sentry report, so we can investigate further). Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Andrei for the simplification. This way no need for client/server restrictions. RFAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/sql/distsqlpb/data.proto, line 43 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
rather than starting to add random errors here, wouldn't it be better to turn
NodeUnavailableErr
into a pgerror? Even one with the internal error code. Nobody in DistSQL needs to recognize this error I believe.
"internal error" is exactly what NewAssertionError (the current code) does and that triggers a sentry report. The issue #37215 was filed because we don't want a sentry report every time a node is down.
However I do like the idea to use a pgerror here, just because it doesn't raise version compatibility questions. Thanks for the hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but consider the comment about using a different code
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @knz)
pkg/sql/distsqlpb/data.go, line 160 at r4 (raw file):
return &Error{ Detail: &Error_PGError{ PGError: pgerror.Newf(pgerror.CodeConnectionExceptionError, "%v", e),
This error code doesn't seem right to me. I can't find any docs on it, but the other codes in its class refer to errors about a client connection, not something internal to the cluster.
I can't see another code that would apply, which is not surprising given that Postgres doesn't have such issues. I'd introduce another code akin to the one that we've already introduced for something similar (CodeRangeUnavailable
), or just use that one.
A node being down during distsql query processing is a legitimate (and expected) error. It needs not be reported to telemetry. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)
pkg/sql/distsqlpb/data.go, line 160 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
This error code doesn't seem right to me. I can't find any docs on it, but the other codes in its class refer to errors about a client connection, not something internal to the cluster.
I can't see another code that would apply, which is not surprising given that Postgres doesn't have such issues. I'd introduce another code akin to the one that we've already introduced for something similar (CodeRangeUnavailable
), or just use that one.
Good idea. Done.
TFYR! bors r+ |
Build failed |
Build timeout. retrying bors r+ |
Build succeeded |
Prior to cockroachdb#37367, a node unavailable error was reported in distsql as a pgerror with code "internal" (assertion). Change cockroachdb#37367 changes this to report node availability using a different code. Meanwhile, the schema change logic wants to be able to retry if a schema change appears to fail due a node going down. Since this is not exercised in CI (only in nightly test), cockroachdb#37367 forgot about that. This commit completes the fix. (Note that this dance with error codes are band-aids; a more robust fix is upcoming in cockroachdb#37765 and following.) Release note: None
37800: sql: fix schema change auto-retry upon node failures r=knz a=knz Prior to #37367, a node unavailable error was reported in distsql as a pgerror with code "internal" (assertion). Change #37367 changes this to report node availability using a different code. Meanwhile, the schema change logic wants to be able to retry if a schema change appears to fail due a node going down. Since this is not exercised in CI (only in nightly test), #37367 forgot about that. This commit completes the fix. (Note that this dance with error codes are band-aids; a more robust fix is upcoming in #37765 and following.) Release note: None Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
Prior to cockroachdb#37367, a node unavailable error was reported in distsql as a pgerror with code "internal" (assertion). Change cockroachdb#37367 changes this to report node availability using a different code. Meanwhile, the schema change logic wants to be able to retry if a schema change appears to fail due a node going down. Since this is not exercised in CI (only in nightly test), cockroachdb#37367 forgot about that. This commit completes the fix. (Note that this dance with error codes are band-aids; a more robust fix is upcoming in cockroachdb#37765 and following.) Release note: None
Fixes #37215.
A node being down during distsql query processing is a legitimate (and
expected) error. It needs not be reported to telemetry.