distsqlpb: whitelist node unavailability errors #37367

knz · 2019-05-07T21:38:28Z

A node being down during distsql query processing is a legitimate (and
expected) error. It needs not be reported to telemetry.

cockroach-teamcity · 2019-05-07T21:38:37Z

This change is

knz · 2019-05-07T21:39:27Z

@RaduBerinde this is the change I was telling you about. I need a way to ensure that new distsql execution nodes (with this patch in) are not being used by old gateways (without this patch) so we don't get a crash upon non-decodable errors. What's the right combination of expected versions that will give me that guarantee?

RaduBerinde · 2019-05-07T21:53:01Z

Sounds like you need to bump both distsqlrun.Version and MinAcceptedVersion to 23.

Is this something you want to backport to 19.1.x? (bumping the version could be too disruptive for a point release)

knz · 2019-05-07T21:57:24Z

I wasn't directly considering a backport, but if we don't we'll want to decide what to do about #37215 in that version.

knz · 2019-05-07T21:57:40Z

@andreimatei do you have further thoughts?

andreimatei

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/sql/distsqlpb/data.proto, line 43 at r2 (raw file):

    pgerror.Error pg_error = 1 [(gogoproto.customname) = "PGError"];
    roachpb.UnhandledRetryableError retryableTxnError = 2;
    roachpb.NodeUnavailableError nodeUnavailableError = 3;

rather than starting to add random errors here, wouldn't it be better to turn NodeUnavailableErr into a pgerror? Even one with the internal error code. Nobody in DistSQL needs to recognize this error I believe.

Prior to this patch, a distsql gateway would crash if it received an error payload of a type it didn't know about. This is unfair to the user, as an error (regardless of payload) is just an error. This patch removes the panic and produces a valid error (with a Sentry report, so we can investigate further). Release note: None

knz

Thanks Andrei for the simplification. This way no need for client/server restrictions. RFAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)

pkg/sql/distsqlpb/data.proto, line 43 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

rather than starting to add random errors here, wouldn't it be better to turn NodeUnavailableErr into a pgerror? Even one with the internal error code. Nobody in DistSQL needs to recognize this error I believe.

"internal error" is exactly what NewAssertionError (the current code) does and that triggers a sentry report. The issue #37215 was filed because we don't want a sentry report every time a node is down.

However I do like the idea to use a pgerror here, just because it doesn't raise version compatibility questions. Thanks for the hint.

andreimatei

but consider the comment about using a different code

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @knz)

pkg/sql/distsqlpb/data.go, line 160 at r4 (raw file):

		return &Error{
			Detail: &Error_PGError{
				PGError: pgerror.Newf(pgerror.CodeConnectionExceptionError, "%v", e),

This error code doesn't seem right to me. I can't find any docs on it, but the other codes in its class refer to errors about a client connection, not something internal to the cluster.
I can't see another code that would apply, which is not surprising given that Postgres doesn't have such issues. I'd introduce another code akin to the one that we've already introduced for something similar (CodeRangeUnavailable), or just use that one.

A node being down during distsql query processing is a legitimate (and expected) error. It needs not be reported to telemetry. Release note: None

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)

pkg/sql/distsqlpb/data.go, line 160 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This error code doesn't seem right to me. I can't find any docs on it, but the other codes in its class refer to errors about a client connection, not something internal to the cluster.
I can't see another code that would apply, which is not surprising given that Postgres doesn't have such issues. I'd introduce another code akin to the one that we've already introduced for something similar (CodeRangeUnavailable), or just use that one.

Good idea. Done.

knz · 2019-05-24T11:33:51Z

TFYR!

bors r+

craig · 2019-05-24T12:02:18Z

Build failed

GitHub CI (Cockroach)

knz · 2019-05-24T12:09:21Z

Build timeout. retrying

bors r+

37367: distsqlpb: whitelist node unavailability errors r=knz a=knz Fixes #37215. A node being down during distsql query processing is a legitimate (and expected) error. It needs not be reported to telemetry. Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>

craig · 2019-05-24T12:38:02Z

Build succeeded

GitHub CI (Cockroach)

Prior to cockroachdb#37367, a node unavailable error was reported in distsql as a pgerror with code "internal" (assertion). Change cockroachdb#37367 changes this to report node availability using a different code. Meanwhile, the schema change logic wants to be able to retry if a schema change appears to fail due a node going down. Since this is not exercised in CI (only in nightly test), cockroachdb#37367 forgot about that. This commit completes the fix. (Note that this dance with error codes are band-aids; a more robust fix is upcoming in cockroachdb#37765 and following.) Release note: None

37800: sql: fix schema change auto-retry upon node failures r=knz a=knz Prior to #37367, a node unavailable error was reported in distsql as a pgerror with code "internal" (assertion). Change #37367 changes this to report node availability using a different code. Meanwhile, the schema change logic wants to be able to retry if a schema change appears to fail due a node going down. Since this is not exercised in CI (only in nightly test), #37367 forgot about that. This commit completes the fix. (Note that this dance with error codes are band-aids; a more robust fix is upcoming in #37765 and following.) Release note: None Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>

Prior to cockroachdb#37367, a node unavailable error was reported in distsql as a pgerror with code "internal" (assertion). Change cockroachdb#37367 changes this to report node availability using a different code. Meanwhile, the schema change logic wants to be able to retry if a schema change appears to fail due a node going down. Since this is not exercised in CI (only in nightly test), cockroachdb#37367 forgot about that. This commit completes the fix. (Note that this dance with error codes are band-aids; a more robust fix is upcoming in cockroachdb#37765 and following.) Release note: None

knz requested a review from a team May 7, 2019 21:38

andreimatei reviewed May 9, 2019

View reviewed changes

knz force-pushed the 20190507-distsql-err branch from 4d51288 to 65fadbb Compare May 19, 2019 15:51

knz commented May 19, 2019

View reviewed changes

knz force-pushed the 20190507-distsql-err branch from 65fadbb to 64cda09 Compare May 20, 2019 11:26

andreimatei approved these changes May 20, 2019

View reviewed changes

distsqlpb: whitelist node unavailability errors

2e606ed

A node being down during distsql query processing is a legitimate (and expected) error. It needs not be reported to telemetry. Release note: None

knz force-pushed the 20190507-distsql-err branch from 64cda09 to 2e606ed Compare May 24, 2019 11:33

knz commented May 24, 2019

View reviewed changes

knz mentioned this pull request May 24, 2019

release-19.1: distsqlpb: whitelist node unavailability errors #37789

Merged

craig bot merged commit 2e606ed into cockroachdb:master May 24, 2019

knz deleted the 20190507-distsql-err branch May 24, 2019 12:45

knz mentioned this pull request May 24, 2019

sql: fix schema change auto-retry upon node failures #37800

Merged

knz mentioned this pull request Nov 10, 2019

User-facing changes in 19.2 that were not picked up in release notes cockroachdb/docs#5819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsqlpb: whitelist node unavailability errors #37367

distsqlpb: whitelist node unavailability errors #37367

knz commented May 7, 2019

cockroach-teamcity commented May 7, 2019

knz commented May 7, 2019

RaduBerinde commented May 7, 2019

knz commented May 7, 2019

knz commented May 7, 2019

andreimatei left a comment

knz left a comment

andreimatei left a comment

knz left a comment

knz commented May 24, 2019

craig bot commented May 24, 2019

knz commented May 24, 2019

craig bot commented May 24, 2019

distsqlpb: whitelist node unavailability errors #37367

distsqlpb: whitelist node unavailability errors #37367

Conversation

knz commented May 7, 2019

cockroach-teamcity commented May 7, 2019

knz commented May 7, 2019

RaduBerinde commented May 7, 2019

knz commented May 7, 2019

knz commented May 7, 2019

andreimatei left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

knz commented May 24, 2019

craig bot commented May 24, 2019

Build failed

knz commented May 24, 2019

craig bot commented May 24, 2019

Build succeeded