-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: v1.45 of grpc now correctly returns context cancelled error instead of unknown #78197
Labels
A-server-start-drain
Pertains to server startup and shutdown sequences
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-server-and-security
DB Server & Security
Comments
DarrylWong
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-server-start-drain
Pertains to server startup and shutdown sequences
labels
Mar 21, 2022
stevendanna
added a commit
to stevendanna/cockroach
that referenced
this issue
Mar 25, 2022
Previously, we used grpcutil.IsContextCanceled to detect when a returned gRPC error was the result of a context cancellation. I believe that the intent of this code was to detect when the _local_ context was cancelled, indicating that we are shutting down and thus the watch-pods-client goroutine should exit. This works because the gRPC library converts a local context.Canceled error into a gRPC error. And, in gRPC before 1.45, if a server handler returned context.Canceled, the returned gRPC error would have status.Unknown, and thus not trigger this exit behavior. As of gRPC 1.45, however, a context.Canceled error returned by a server handler will also result in a gRPC error with status.Canceled [0], meaning that the previous code will force the goroutine to exit in response to a server-side error. From my reading of this code, it appears we want to retry all server-side errors. To account for this, we now only break out of the retry loop if our local context is done. Further, I've changed the test directory server implementation to return an arguably more appropriate error when it is shutting down. Fixes cockroachdb#78197 Release note: None
stevendanna
added a commit
to stevendanna/cockroach
that referenced
this issue
Mar 25, 2022
Previously, we used grpcutil.IsContextCanceled to detect when a returned gRPC error was the result of a context cancellation. I believe that the intent of this code was to detect when the _local_ context was cancelled, indicating that we are shutting down and thus the watch-pods-client goroutine should exit. This works because the gRPC library converts a local context.Canceled error into a gRPC error. And, in gRPC before 1.45, if a server handler returned context.Canceled, the returned gRPC error would have status.Unknown, and thus not trigger this exit behavior. As of gRPC 1.45, however, a context.Canceled error returned by a server handler will also result in a gRPC error with status.Canceled [0], meaning that the previous code will force the goroutine to exit in response to a server-side error. From my reading of this code, it appears we want to retry all server-side errors. To account for this, we now only break out of the retry loop if our local context is done. Further, I've changed the test directory server implementation to return an arguably more appropriate error when it is shutting down. Fixes cockroachdb#78197 Release note: None
craig bot
pushed a commit
that referenced
this issue
Mar 28, 2022
78241: kvserver: de-flake TestReplicaCircuitBreaker_RangeFeed r=erikgrinaker a=tbg Fixes #76856. Release note: None 78312: roachtest: improve debugging in transfer-leases r=erikgrinaker a=tbg This test failed once and we weren't able to figure out why; having the range status used by the test would've been useful. Now this is saved and so the next time it fails we'll have more to look at. Closes #75438. Release note: None 78422: roachtest: bump max wh for weekly tpccbench/nodes=12/cpu=16 r=srosenberg a=tbg [It was maxing out, reliably.](https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D12%2Fcpu%3D16&tab=gce) Release note: None 78490: sqlproxyccl: exit pod-watcher-client on local context cancellation r=jaylim-crl,darinpp a=stevendanna Previously, we used grpcutil.IsContextCanceled to detect when a returned gRPC error was the result of a context cancellation. I believe that the intent of this code was to detect when the _local_ context was cancelled, indicating that we are shutting down and thus the watch-pods-client goroutine should exit. This works because the gRPC library converts a local context.Canceled error into a gRPC error. And, in gRPC before 1.45, if a server handler returned context.Canceled, the returned gRPC error would have status.Unknown, and thus not trigger this exit behavior. As of gRPC 1.45, however, a context.Canceled error returned by a server handler will also result in a gRPC error with status.Canceled [0], meaning that the previous code will force the goroutine to exit in response to a server-side error. From my reading of this code, it appears we want to retry all server-side errors. To account for this, we now only break out of the retry loop if our local context is done. Further, I've changed the test directory server implementation to return an arguably more appropriate error when it is shutting down. Fixes #78197 Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-server-start-drain
Pertains to server startup and shutdown sequences
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-server-and-security
DB Server & Security
Upgrading google.golang.org/grpc to v1.45.0 causes the test
TestDirectoryConnect/drain_connection
to fail because of a change in howcontext.Canceled
errors returned by server request handlers are returned to clients.The test starts a sqlproxy server and a directory server. It then shuts down the test directory server, starts a second test directory server, and then calls
Drain()
on the second test directory server, expecting that eventually SQL connections will be drained.This test fails since Drain() on the second TestDirectoryServer does not do anything as there is no event listener ever added to the TestDirectoryServer. No event listener is added because the goroutine in the sql proxy server responsible for calling WatchPods on the TestDirectoryServer
cockroach/pkg/ccl/sqlproxyccl/tenant/directory.go
Line 310 in 4e7ba29
exits in response to the first TestDirectoryServer shutting down, rather than attempting to connect to the newly started TestDirectoryServer and registering a listener.
The goroutine exits because of the following lines in watchPods():
cockroach/pkg/ccl/sqlproxyccl/tenant/directory.go
Lines 343 to 345 in 4e7ba29
where IsContextCanceled() is:
cockroach/pkg/util/grpcutil/grpc_util.go
Lines 62 to 67 in 4e7ba29
In v1.44 of grpc and before, when a server-side handler returned a
context.Canceled
error gRPC would return a gRPC error with statusUnknown
. As of v1.45 (grpc/grpc-go#5156 ), it now returns an error with the statusCanceled
. As a result, IsContextCanceled() now returns true when the server side request handler returns a context.Canceled error, whereas it previously returned false.In the TestDirectoryServer, we currently return
context.Canceled
in response to a quiescing stopper:cockroach/pkg/ccl/sqlproxyccl/tenantdirsvr/test_directory_svr.go
Line 213 in 4e7ba29
It appears that the
IsContextCanceled
check in the proxy server was likely intended to only catch cancellations of the local context, since all other errors experienced at that point results in starting up the watchPods() handler again.This functionality is now broken as stopping the server in line 707 of TestDirectoryConnect/drain_connection incorrectly stops the watchPods() handler as stopping the server returns a context.Canceled error.
Jira issue: CRDB-14012
The text was updated successfully, but these errors were encountered: