Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: retry more distributed errors as local #108336

Merged
merged 3 commits into from
Aug 10, 2023

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Aug 7, 2023

This PR contains a couple of commits that increase the allow-list of errors that are retried locally. In particular, it allows us to hide some issues we have around using DistSQL and shutting down SQL pods.

Fixes: #106537.
Fixes: #108152.
Fixes: #108271.

Release note: None

@yuzefovich yuzefovich requested review from a team as code owners August 7, 2023 22:29
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @mgartner)


pkg/sql/distsql_running.go line 107 at r1 (raw file):

}

func isDialErr(err error) bool {

I briefly looked into using errors.Is or errors.As here but wasn't sure what's the canonical way that would definitely work. Probably Raphael has a suggestion here. Reimplemented with errors.HasType.

@yuzefovich yuzefovich force-pushed the sql-retryable branch 3 times, most recently from b605ba8 to dbaeb23 Compare August 8, 2023 02:06
@yuzefovich yuzefovich requested review from a team as code owners August 8, 2023 03:07
@yuzefovich yuzefovich requested review from rhu713 and removed request for a team August 8, 2023 03:07
@yuzefovich yuzefovich changed the title sql: retry all DistSQL runner dial errors sql: retry more distributed errors as local Aug 8, 2023
@yuzefovich yuzefovich removed request for a team and rhu713 August 8, 2023 03:09
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another commit to include more errors in the allow-list for retrying. Done pushing for now, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @mgartner)

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 3 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @yuzefovich)


pkg/ccl/serverccl/server_sql_test.go line 441 at r2 (raw file):

	}()

	listener, err := net.Listen("tcp", rpcAddr)

Why is this listener required?

Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -100,6 +101,22 @@ type runnerResult struct {
err error
}

type runnerDialErr struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach of tagging errors that occur during the dial flow. This is a robust way to handle issues that occur during start up, which is the most common case we have seen for Serverless.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jeffswenson and @knz)


pkg/ccl/serverccl/server_sql_test.go line 441 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Why is this listener required?

I'm not sure, I just copied this test from Jeff #106538. @jeffswenson can you share some context about this listener?

@jeffswenson
Copy link
Collaborator

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jeffswenson and @knz)

pkg/ccl/serverccl/server_sql_test.go line 441 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…
I'm not sure, I just copied this test from Jeff #106538. @jeffswenson can you share some context about this listener?

The test is covering the following scenario:

  1. A sql server starts up and is assigned port a
  2. The sql server shuts down and releases port a
  3. Something else starts up and claims port a. In the test that is the listener. This is important because the listener causes connections to a to hang instead of responding with a RESET packet.
  4. A different server with stale instance information schedules a distsql flow and attempts to dial port a.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of the string matching for errors in IsDistSQLRetryable is not great, but it's not a new problem and we do have #82847 as a tracking issue to improve the situation.

TFTRs!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffswenson, @knz, and @mgartner)


pkg/ccl/serverccl/server_sql_test.go line 441 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

The test is covering the following scenario:

  1. A sql server starts up and is assigned port a
  2. The sql server shuts down and releases port a
  3. Something else starts up and claims port a. In the test that is the listener. This is important because the listener causes connections to a to hang instead of responding with a RESET packet.
  4. A different server with stale instance information schedules a distsql flow and attempts to dial port a.

Thanks for this context. I incorporated it as a comment on the test.

@craig
Copy link
Contributor

craig bot commented Aug 8, 2023

Build failed:

@yuzefovich
Copy link
Member Author

Hm, I'm confused

[23:12:22][Run unit tests] ERROR: /go/src/github.com/cockroachdb/cockroach/pkg/jobs/BUILD.bazel:88:8: GoCompilePkg pkg/jobs/autoconfig.recompile1376.a failed: (Exit 1): builder failed: error executing command (from target //pkg/jobs:jobs_test) bazel-out/k8-opt-exec-2B5CBBC6/bin/external/go_sdk/builder_reset/builder compilepkg -sdk external/go_sdk -installsuffix linux_amd64 -tags bazel,gss,bazel,gss -src pkg/server/autoconfig/auto_config.go ... (remaining 62 arguments skipped)
[23:12:22][Run unit tests] 
[23:12:22][Run unit tests] Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
[23:12:22][Run unit tests] compilepkg: panic: runtime error: invalid memory address or nil pointer dereference
[23:12:22][Run unit tests] [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5e89bc]

Let's try one more time.

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 8, 2023

Build failed:

@yuzefovich
Copy link
Member Author

Known failure #108340.

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 9, 2023

Build failed:

@yuzefovich
Copy link
Member Author

The latest failure is at least legitimate:

=== RUN   TestStartTenantWithStaleInstance
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/c101b7a464a1afc1f5af0cd85792187e/logTestStartTenantWithStaleInstance3465982568
    test_log_scope.go:81: use -show-logs to present logs inline
    server_sql_test.go:450: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/ccl/serverccl/server_sql_test.go:450
        	Error:      	Received unexpected error:
        	            	listen tcp 127.0.0.1:45845: bind: address already in use
        	Test:       	TestStartTenantWithStaleInstance

I'll take a look tomorrow, but I did stress this new test on the gceworker with no failures.

@mgartner
Copy link
Collaborator

mgartner commented Aug 9, 2023

pkg/ccl/serverccl/server_sql_test.go line 441 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Thanks for this context. I incorporated it as a comment on the test.

Thanks for the context. The only part I'm missing is how the "different server" is dialing rpcAddr - I don't see how that's forced.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffswenson, @knz, and @mgartner)


pkg/ccl/serverccl/server_sql_test.go line 450 at r7 (raw file):

	listener, err := net.Listen("tcp", rpcAddr)
	require.NoError(t, err)

For the failure that I got under stress on CI on this line:

=== RUN   TestStartTenantWithStaleInstance
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/c101b7a464a1afc1f5af0cd85792187e/logTestStartTenantWithStaleInstance3465982568
    test_log_scope.go:81: use -show-logs to present logs inline
    server_sql_test.go:450: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/ccl/serverccl/server_sql_test.go:450
        	Error:      	Received unexpected error:
        	            	listen tcp 127.0.0.1:45845: bind: address already in use
        	Test:       	TestStartTenantWithStaleInstance

it seems like the stopped tenant hasn't released the socket, and to me it seems like a benign error with the test setup. I'm inclined to introduce SucceedsSoon until net.Listen doesn't return an error. @jeffswenson WDYT?

This commit marks the error that DistSQL runners produce when dialing
remote nodes in a special way that is now always retried-as-local. In
particular, this allows us to fix two problematic scenarios that could
occur when using secondary tenants:
- when attempting to start a pod with stale instance information
- the port is in use by an RPC server for the same tenant, but with
a new instance id.

This commit includes the test from Jeff that exposed the gap in the
retry-as-local mechanism.

Release note: None
This commit moves the "drain succeeded" logging message to be at the
very end of the drain process. Also, it removes now stale comment.

Release note: None
This commit includes all errors that contain `rpc error` substring to be
retried-as-local. In particular, this allows us to avoid problems with
DistSQL using no-longer-live SQL pod after that pod is shutdown. (This
usage of the downed pod is currently expected given that the cache of
live instances isn't updated when the pod is shutdown.)

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffswenson, @knz, and @mgartner)


pkg/ccl/serverccl/server_sql_test.go line 441 at r2 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Thanks for the context. The only part I'm missing is how the "different server" is dialing rpcAddr - I don't see how that's forced.

I think I see this. In step 4. the "different server" is the new being-started-right-now SQL pod (which itself will listen on a different port), and when starting up, it sees that there is another SQL pod for the same tenant in sql_instances system table (see #106537 (comment)). Thus, it attempts to dial that SQL pod to perform the startup migration, which fails because we have listener not responding to that dial attempt.

@jeffswenson
Copy link
Collaborator

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffswenson, @knz, and @mgartner)

pkg/ccl/serverccl/server_sql_test.go line 450 at r7 (raw file):

	listener, err := net.Listen("tcp", rpcAddr)
	require.NoError(t, err)

For the failure that I got under stress on CI on this line:

=== RUN   TestStartTenantWithStaleInstance
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/c101b7a464a1afc1f5af0cd85792187e/logTestStartTenantWithStaleInstance3465982568
    test_log_scope.go:81: use -show-logs to present logs inline
    server_sql_test.go:450: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/ccl/serverccl/server_sql_test.go:450
        	Error:      	Received unexpected error:
        	            	listen tcp 127.0.0.1:45845: bind: address already in use
        	Test:       	TestStartTenantWithStaleInstance

it seems like the stopped tenant hasn't released the socket, and to me it seems like a benign error with the test setup. I'm inclined to introduce SucceedsSoon until net.Listen doesn't return an error. @jeffswenson WDYT?

Adding a SucceedsSoon LGTM. Another option is we could start the listener, then inject a synthetic sql_instance row. But I like the current implementation of the test because it is very generic and has minimal coupling to the values in the sql_instance table.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffswenson, @knz, and @mgartner)


pkg/ccl/serverccl/server_sql_test.go line 450 at r7 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

Adding a SucceedsSoon LGTM. Another option is we could start the listener, then inject a synthetic sql_instance row. But I like the current implementation of the test because it is very generic and has minimal coupling to the values in the sql_instance table.

Thanks, let's keep SucceedsSoon then.

@craig craig bot merged commit 1f8fa96 into cockroachdb:master Aug 10, 2023
2 checks passed
@craig
Copy link
Contributor

craig bot commented Aug 10, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants