-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: retry more distributed errors as local #108336
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @mgartner)
pkg/sql/distsql_running.go
line 107 at r1 (raw file):
} func isDialErr(err error) bool {
I briefly looked into using Reimplemented with errors.Is
or errors.As
here but wasn't sure what's the canonical way that would definitely work. Probably Raphael has a suggestion here.errors.HasType
.
8c2c676
to
3c82a67
Compare
b605ba8
to
dbaeb23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added another commit to include more errors in the allow-list for retrying. Done pushing for now, PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @mgartner)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jeffswenson, @knz, and @yuzefovich)
pkg/ccl/serverccl/server_sql_test.go
line 441 at r2 (raw file):
}() listener, err := net.Listen("tcp", rpcAddr)
Why is this listener required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -100,6 +101,22 @@ type runnerResult struct { | |||
err error | |||
} | |||
|
|||
type runnerDialErr struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach of tagging errors that occur during the dial flow. This is a robust way to handle issues that occur during start up, which is the most common case we have seen for Serverless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jeffswenson and @knz)
pkg/ccl/serverccl/server_sql_test.go
line 441 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Why is this listener required?
I'm not sure, I just copied this test from Jeff #106538. @jeffswenson can you share some context about this listener?
The test is covering the following scenario:
|
0028b76
to
50a9308
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage of the string matching for errors in IsDistSQLRetryable
is not great, but it's not a new problem and we do have #82847 as a tracking issue to improve the situation.
TFTRs!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffswenson, @knz, and @mgartner)
pkg/ccl/serverccl/server_sql_test.go
line 441 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
The test is covering the following scenario:
- A sql server starts up and is assigned port
a
- The sql server shuts down and releases port
a
- Something else starts up and claims port
a
. In the test that is the listener. This is important because the listener causes connections toa
to hang instead of responding with a RESET packet.- A different server with stale instance information schedules a distsql flow and attempts to dial port
a
.
Thanks for this context. I incorporated it as a comment on the test.
Build failed: |
Hm, I'm confused
Let's try one more time. bors r+ |
Build failed: |
Known failure #108340. bors r+ |
Build failed: |
The latest failure is at least legitimate:
I'll take a look tomorrow, but I did stress this new test on the gceworker with no failures. |
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Thanks for the context. The only part I'm missing is how the "different server" is dialing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffswenson, @knz, and @mgartner)
pkg/ccl/serverccl/server_sql_test.go
line 450 at r7 (raw file):
listener, err := net.Listen("tcp", rpcAddr) require.NoError(t, err)
For the failure that I got under stress on CI on this line:
=== RUN TestStartTenantWithStaleInstance
test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/c101b7a464a1afc1f5af0cd85792187e/logTestStartTenantWithStaleInstance3465982568
test_log_scope.go:81: use -show-logs to present logs inline
server_sql_test.go:450:
Error Trace: github.com/cockroachdb/cockroach/pkg/ccl/serverccl/server_sql_test.go:450
Error: Received unexpected error:
listen tcp 127.0.0.1:45845: bind: address already in use
Test: TestStartTenantWithStaleInstance
it seems like the stopped tenant hasn't released the socket, and to me it seems like a benign error with the test setup. I'm inclined to introduce SucceedsSoon until net.Listen
doesn't return an error. @jeffswenson WDYT?
50a9308
to
a8d6fd4
Compare
This commit marks the error that DistSQL runners produce when dialing remote nodes in a special way that is now always retried-as-local. In particular, this allows us to fix two problematic scenarios that could occur when using secondary tenants: - when attempting to start a pod with stale instance information - the port is in use by an RPC server for the same tenant, but with a new instance id. This commit includes the test from Jeff that exposed the gap in the retry-as-local mechanism. Release note: None
This commit moves the "drain succeeded" logging message to be at the very end of the drain process. Also, it removes now stale comment. Release note: None
This commit includes all errors that contain `rpc error` substring to be retried-as-local. In particular, this allows us to avoid problems with DistSQL using no-longer-live SQL pod after that pod is shutdown. (This usage of the downed pod is currently expected given that the cache of live instances isn't updated when the pod is shutdown.) Release note: None
a8d6fd4
to
5c09eb1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffswenson, @knz, and @mgartner)
pkg/ccl/serverccl/server_sql_test.go
line 441 at r2 (raw file):
Previously, mgartner (Marcus Gartner) wrote…
Thanks for the context. The only part I'm missing is how the "different server" is dialing
rpcAddr
- I don't see how that's forced.
I think I see this. In step 4. the "different server" is the new being-started-right-now SQL pod (which itself will listen on a different port), and when starting up, it sees that there is another SQL pod for the same tenant in sql_instances
system table (see #106537 (comment)). Thus, it attempts to dial that SQL pod to perform the startup migration, which fails because we have listener
not responding to that dial attempt.
Adding a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jeffswenson, @knz, and @mgartner)
pkg/ccl/serverccl/server_sql_test.go
line 450 at r7 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
Adding a
SucceedsSoon
LGTM. Another option is we could start the listener, then inject a synthetic sql_instance row. But I like the current implementation of the test because it is very generic and has minimal coupling to the values in the sql_instance table.
Thanks, let's keep SucceedsSoon
then.
Build succeeded: |
This PR contains a couple of commits that increase the allow-list of errors that are retried locally. In particular, it allows us to hide some issues we have around using DistSQL and shutting down SQL pods.
Fixes: #106537.
Fixes: #108152.
Fixes: #108271.
Release note: None