Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsqlpb: whitelist node unavailability errors #37367

Merged
merged 2 commits into from
May 24, 2019

Conversation

knz
Copy link
Contributor

@knz knz commented May 7, 2019

Fixes #37215.

A node being down during distsql query processing is a legitimate (and
expected) error. It needs not be reported to telemetry.

@knz knz requested a review from a team May 7, 2019 21:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented May 7, 2019

@RaduBerinde this is the change I was telling you about. I need a way to ensure that new distsql execution nodes (with this patch in) are not being used by old gateways (without this patch) so we don't get a crash upon non-decodable errors. What's the right combination of expected versions that will give me that guarantee?

@RaduBerinde
Copy link
Member

Sounds like you need to bump both distsqlrun.Version and MinAcceptedVersion to 23.

Is this something you want to backport to 19.1.x? (bumping the version could be too disruptive for a point release)

@knz
Copy link
Contributor Author

knz commented May 7, 2019

I wasn't directly considering a backport, but if we don't we'll want to decide what to do about #37215 in that version.

@knz
Copy link
Contributor Author

knz commented May 7, 2019

@andreimatei do you have further thoughts?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/sql/distsqlpb/data.proto, line 43 at r2 (raw file):

    pgerror.Error pg_error = 1 [(gogoproto.customname) = "PGError"];
    roachpb.UnhandledRetryableError retryableTxnError = 2;
    roachpb.NodeUnavailableError nodeUnavailableError = 3;

rather than starting to add random errors here, wouldn't it be better to turn NodeUnavailableErr into a pgerror? Even one with the internal error code. Nobody in DistSQL needs to recognize this error I believe.

Prior to this patch, a distsql gateway would crash if it received an
error payload of a type it didn't know about. This is unfair to the
user, as an error (regardless of payload) is just an error.

This patch removes the panic and produces a valid error (with a Sentry
report, so we can investigate further).

Release note: None
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrei for the simplification. This way no need for client/server restrictions. RFAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/sql/distsqlpb/data.proto, line 43 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

rather than starting to add random errors here, wouldn't it be better to turn NodeUnavailableErr into a pgerror? Even one with the internal error code. Nobody in DistSQL needs to recognize this error I believe.

"internal error" is exactly what NewAssertionError (the current code) does and that triggers a sentry report. The issue #37215 was filed because we don't want a sentry report every time a node is down.

However I do like the idea to use a pgerror here, just because it doesn't raise version compatibility questions. Thanks for the hint.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

but consider the comment about using a different code

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @knz)


pkg/sql/distsqlpb/data.go, line 160 at r4 (raw file):

		return &Error{
			Detail: &Error_PGError{
				PGError: pgerror.Newf(pgerror.CodeConnectionExceptionError, "%v", e),

This error code doesn't seem right to me. I can't find any docs on it, but the other codes in its class refer to errors about a client connection, not something internal to the cluster.
I can't see another code that would apply, which is not surprising given that Postgres doesn't have such issues. I'd introduce another code akin to the one that we've already introduced for something similar (CodeRangeUnavailable), or just use that one.

A node being down during distsql query processing is a legitimate (and
expected) error. It needs not be reported to telemetry.

Release note: None
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


pkg/sql/distsqlpb/data.go, line 160 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This error code doesn't seem right to me. I can't find any docs on it, but the other codes in its class refer to errors about a client connection, not something internal to the cluster.
I can't see another code that would apply, which is not surprising given that Postgres doesn't have such issues. I'd introduce another code akin to the one that we've already introduced for something similar (CodeRangeUnavailable), or just use that one.

Good idea. Done.

@knz
Copy link
Contributor Author

knz commented May 24, 2019

TFYR!

bors r+

@craig
Copy link
Contributor

craig bot commented May 24, 2019

Build failed

@knz
Copy link
Contributor Author

knz commented May 24, 2019

Build timeout. retrying

bors r+

craig bot pushed a commit that referenced this pull request May 24, 2019
37367: distsqlpb: whitelist node unavailability errors r=knz a=knz

Fixes #37215.

A node being down during distsql query processing is a legitimate (and
expected) error. It needs not be reported to telemetry.

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented May 24, 2019

Build succeeded

@craig craig bot merged commit 2e606ed into cockroachdb:master May 24, 2019
@knz knz deleted the 20190507-distsql-err branch May 24, 2019 12:45
knz added a commit to knz/cockroach that referenced this pull request May 24, 2019
Prior to cockroachdb#37367, a node unavailable error was reported in distsql as
a pgerror with code "internal" (assertion). Change cockroachdb#37367 changes this
to report node availability using a different code.

Meanwhile, the schema change logic wants to be able to retry if a
schema change appears to fail due a node going down. Since this is not
exercised in CI (only in nightly test), cockroachdb#37367 forgot about that. This
commit completes the fix.

(Note that this dance with error codes are band-aids; a more robust
fix is upcoming in cockroachdb#37765 and following.)

Release note: None
craig bot pushed a commit that referenced this pull request May 24, 2019
37800: sql: fix schema change auto-retry upon node failures r=knz a=knz

Prior to #37367, a node unavailable error was reported in distsql as
a pgerror with code "internal" (assertion). Change #37367 changes this
to report node availability using a different code.

Meanwhile, the schema change logic wants to be able to retry if a
schema change appears to fail due a node going down. Since this is not
exercised in CI (only in nightly test), #37367 forgot about that. This
commit completes the fix.

(Note that this dance with error codes are band-aids; a more robust
fix is upcoming in #37765 and following.)

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
knz added a commit to knz/cockroach that referenced this pull request May 24, 2019
Prior to cockroachdb#37367, a node unavailable error was reported in distsql as
a pgerror with code "internal" (assertion). Change cockroachdb#37367 changes this
to report node availability using a different code.

Meanwhile, the schema change logic wants to be able to retry if a
schema change appears to fail due a node going down. Since this is not
exercised in CI (only in nightly test), cockroachdb#37367 forgot about that. This
commit completes the fix.

(Note that this dance with error codes are band-aids; a more robust
fix is upcoming in cockroachdb#37765 and following.)

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants