-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: make ctxWithStacktrace compatible with gRPC #96483
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
if err == nil { | ||
return nil | ||
} | ||
ctx.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think this would be a good place to use syncutil.Once https://pkg.go.dev/go4.org/syncutil#Once.Do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once
would work if errWithStacktrace
were only being accessed from inside Do
, but it is also accessed from errWithStacktrace
so a mutex is needed that guards both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ecwall, @otan, and @rafiss)
-- commits
line 16 at r1:
I know I've suggested this, but I believe it doesn't work when the context expiration is detected by calling Err
on a child ctx created by WithCancel
(and maybe also by WithDeadline/WithTimeout
, I'm not sure) - which I think is the common case when the deadline comes from an rpc call or a sql query or most things, really.
If that's true, I'd reconsider whether having this hack is worth it. Is it ever going to help?
Or, to phrase it as a question - have you verified that this actually shows an interesting stack in at least some cases for the particular deadline that prompted this patch?
pkg/util/contextutil/context.go
line 91 at r1 (raw file):
mu struct { syncutil.Mutex errWithStacktrace error
maybe capturedErr
pkg/util/contextutil/context.go
line 95 at r1 (raw file):
} func (ctx *contextWithStacktrace) ErrWithStacktrace() error {
Let's find a more suggestive name for this. Maybe CapturedErr
?
And put a comment saying something about how it indicates the first detector of the expiration (kinda, since it doesn't deal with Done
).
And also say something about how it doesn't work if there's a context.WithCancel
under it.
Also, don't export it if the type is not exported.
pkg/util/contextutil/context.go
line 95 at r1 (raw file):
} func (ctx *contextWithStacktrace) ErrWithStacktrace() error {
As written, this can return nil even for an expired ctx. End even for a context that was detected to have expired (through calling Err
on a child produced by context.WithCancel
. Are you sure that's what you want?
The one caller you have does use Combine()
, which deals with a nil, but still.
Consider calling Err()
.
pkg/util/tracing/grpcinterceptor/grpc_interceptor_test.go
line 134 at r1 (raw file):
}, { name: "UnaryUnary_ContextTimeout_DeadlineExceeded",
I don't think the new tests are worth it. I think all we need to test is that our ctx implementation respects the contract. I think these "integration tests" will no t make much sense to any reader.
oh oops, missed Andrei's comments. I'll leave the review to him. could you share an example of what the get-user log looks like now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @otan)
Previously, andreimatei (Andrei Matei) wrote…
I know I've suggested this, but I believe it doesn't work when the context expiration is detected by calling
Err
on a child ctx created byWithCancel
(and maybe also byWithDeadline/WithTimeout
, I'm not sure) - which I think is the common case when the deadline comes from an rpc call or a sql query or most things, really.
If that's true, I'd reconsider whether having this hack is worth it. Is it ever going to help?
Or, to phrase it as a question - have you verified that this actually shows an interesting stack in at least some cases for the particular deadline that prompted this patch?
I set the auth timeout to 1ns
and the returned error generates a stacktrace like this which is what we want:
operation "get-user-timeout" timed out after 0s (given timeout 1ns): internal error while retrieving user account: context deadline exceeded
(1) operation "get-user-timeout" timed out after 0s (given timeout 1ns)
Wraps: (2) secondary error attachment
| captured context error: context deadline exceeded
| (1) attached stack trace
| -- stack trace:
| | github.com/cockroachdb/cockroach/pkg/util/contextutil.(*contextWithStacktrace).Err
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:116
| | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:924
| | github.com/cockroachdb/cockroach/pkg/kv.runTxn
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:959
| | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:922
| | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:897
| | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:1471
| | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:1393
| | github.com/cockroachdb/cockroach/pkg/sql/sessioninit.(*Cache).GetAuthInfo
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/sessioninit/cache.go:124
| | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache.func1
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:210
| | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:234
| | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func1.1
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:98
| | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:131
| | github.com/cockroachdb/cockroach/pkg/sql.getUserInfoRunFn.func2
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:190
| | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func1
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:97
| | github.com/cockroachdb/cockroach/pkg/sql/pgwire.authAutoSelectPasswordProtocol.func1
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth_methods.go:509
| | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*AuthBehaviors).Authenticate
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth_behaviors.go:56
| | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).handleAuthentication
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth.go:178
| | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:693
| | runtime.goexit
| | /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1594
| Wraps: (2) captured context error
| Wraps: (3) context deadline exceeded
| Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) context.deadlineExceededError
Wraps: (3) attached stack trace
-- stack trace:
| github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:237
| github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func1.1
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:98
| github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:131
| github.com/cockroachdb/cockroach/pkg/sql.getUserInfoRunFn.func2
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:190
| github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func1
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:97
| github.com/cockroachdb/cockroach/pkg/sql/pgwire.authAutoSelectPasswordProtocol.func1
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth_methods.go:509
| github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*AuthBehaviors).Authenticate
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth_behaviors.go:56
| github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).handleAuthentication
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth.go:178
| github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:693
| runtime.goexit
| /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1594
Wraps: (4) internal error while retrieving user account
Wraps: (5) context deadline exceeded
| -- cause hidden behind barrier
| context deadline exceeded
| (1) context deadline exceeded
| Error types: (1) context.deadlineExceededError
Error types: (1) *contextutil.TimeoutError (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.withPrefix (5) *barriers.barrierErr
But you are right, it is possible to wrap contextWithStacktrace
in another Context
and call Err()
on it instead which will not generate a useful stacktrace (I added a test case proving this).
pkg/util/contextutil/context.go
line 91 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
maybe
capturedErr
Renamed.
pkg/util/contextutil/context.go
line 95 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Let's find a more suggestive name for this. Maybe
CapturedErr
?
And put a comment saying something about how it indicates the first detector of the expiration (kinda, since it doesn't deal withDone
).
And also say something about how it doesn't work if there's acontext.WithCancel
under it.Also, don't export it if the type is not exported.
Updated.
pkg/util/contextutil/context.go
line 95 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
As written, this can return nil even for an expired ctx. End even for a context that was detected to have expired (through calling
Err
on a child produced bycontext.WithCancel
. Are you sure that's what you want?
The one caller you have does useCombine()
, which deals with a nil, but still.
Consider callingErr()
.
Added Err()
call.
pkg/util/tracing/grpcinterceptor/grpc_interceptor_test.go
line 134 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't think the new tests are worth it. I think all we need to test is that our ctx implementation respects the contract. I think these "integration tests" will no t make much sense to any reader.
Removed these and added an assertion on the error returned by Err()
to the unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @otan)
Previously, ecwall (Evan Wall) wrote…
I set the auth timeout to
1ns
and the returned error generates a stacktrace like this which is what we want:operation "get-user-timeout" timed out after 0s (given timeout 1ns): internal error while retrieving user account: context deadline exceeded (1) operation "get-user-timeout" timed out after 0s (given timeout 1ns) Wraps: (2) secondary error attachment | captured context error: context deadline exceeded | (1) attached stack trace | -- stack trace: | | github.com/cockroachdb/cockroach/pkg/util/contextutil.(*contextWithStacktrace).Err | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:116 | | github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn.go:924 | | github.com/cockroachdb/cockroach/pkg/kv.runTxn | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:959 | | github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:922 | | github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:897 | | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).txn | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:1471 | | github.com/cockroachdb/cockroach/pkg/sql.(*InternalDB).DescsTxn | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/internal.go:1393 | | github.com/cockroachdb/cockroach/pkg/sql/sessioninit.(*Cache).GetAuthInfo | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/sessioninit/cache.go:124 | | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache.func1 | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:210 | | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:234 | | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func1.1 | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:98 | | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:131 | | github.com/cockroachdb/cockroach/pkg/sql.getUserInfoRunFn.func2 | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:190 | | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func1 | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:97 | | github.com/cockroachdb/cockroach/pkg/sql/pgwire.authAutoSelectPasswordProtocol.func1 | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth_methods.go:509 | | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*AuthBehaviors).Authenticate | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth_behaviors.go:56 | | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).handleAuthentication | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth.go:178 | | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1 | | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:693 | | runtime.goexit | | /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1594 | Wraps: (2) captured context error | Wraps: (3) context deadline exceeded | Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) context.deadlineExceededError Wraps: (3) attached stack trace -- stack trace: | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:237 | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func1.1 | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:98 | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:131 | github.com/cockroachdb/cockroach/pkg/sql.getUserInfoRunFn.func2 | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:190 | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func1 | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:97 | github.com/cockroachdb/cockroach/pkg/sql/pgwire.authAutoSelectPasswordProtocol.func1 | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth_methods.go:509 | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*AuthBehaviors).Authenticate | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth_behaviors.go:56 | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).handleAuthentication | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/auth.go:178 | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1 | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:693 | runtime.goexit | /usr/local/opt/go/libexec/src/runtime/asm_amd64.s:1594 Wraps: (4) internal error while retrieving user account Wraps: (5) context deadline exceeded | -- cause hidden behind barrier | context deadline exceeded | (1) context deadline exceeded | Error types: (1) context.deadlineExceededError Error types: (1) *contextutil.TimeoutError (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.withPrefix (5) *barriers.barrierErr
But you are right, it is possible to wrap
contextWithStacktrace
in anotherContext
and callErr()
on it instead which will not generate a useful stacktrace (I added a test case proving this).
I think the alternative to this would take more time: we need to find each place we get an error from a 3rd party lib and make sure we wrap with errors.WithStack
. This includes anytime we call Context.Err()
and return the result.
Fixes #96457 `type ctxWithStacktrace struct` was added here #95797 to add a stacktrace (via `errors.WithStack(...)` when `Context.Err()` is called inside `contextutil.RunWithTimeout(...)`. This ended up causing an issue if gRPC calls `ctxWithStacktrace.Err()` because it is expecting `context.DeadlineExceeded` to not be wrapped with `errors.WithStack` - the wrong status code will be returned in the error returned here https://github.com/grpc/grpc-go/blob/v1.46.0/rpc_util.go#L833. To resolve this, store the wrapped exception separately inside `ctxWithStacktrace` when `ctxWithStacktrace.Err()` is called and set this to `TimeoutError.cause` inside `contextutil.RunWithTimeout(...)`. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ecwall and @otan)
I set the auth timeout to 1ns and the returned error generates a stacktrace like this which is what we want:
I think there's two things that make me question whether your example is realistic:
- By setting such a low timeout, you've made the deadline expiration be detected before the code has a chance to actually do anything - i.e. run an actual RPC in that
InternalDB.DescsTxn()
function. This is not realistic; if the timeout expires while we're inDescsTxn
, it will expire while we're inside an RPC, and that's when I question whether we're going to get a useful stack trace (or any stack trace at all). In particular, the example that would convince me is when the RPC goes to a remote node - because in the local node case gRPC is bypassed. - By setting such a low timeout, even if there's a child
context.WithCancel
on the way (which I claim is common),Err()
would still be called on our parentcontextWithStack
. This is because of a special case here. But, in the common case when theWithCancel
is called before the deadline expires, it wouldn't. This code shows that:
func TestXXX(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond)
defer cancel()
ctx = &contextWithStacktrace{
Context: ctx,
}
ctx2, c := context.WithCancel(ctx)
defer c()
time.Sleep(100 * time.Millisecond)
require.NotNil(t, ctx2.Err())
require.NotNil(t, ctx.(*contextWithStacktrace).ErrWithStacktrace())
}
The last line fails; our context has not captured a stack trace.
So I for one would like to see a more realistic example - one where the timeout expires when the task is blocked on something that actually takes some time.
pkg/util/contextutil/context_test.go
line 97 at r5 (raw file):
} func TestRunWithTimeoutContextWithStacktraceChildContext(t *testing.T) {
I think we need commentary about what this test shows.
I also question if there's much point in asserting the limitations of our code.
Making this change instead #96659 |
Fixes #96457
type ctxWithStacktrace struct
was added here #95797 to add a stacktrace(via
errors.WithStack(...)
whenContext.Err()
is called insidecontextutil.RunWithTimeout(...)
.This ended up causing an issue if gRPC calls
ctxWithStacktrace.Err()
because it is expecting
context.DeadlineExceeded
to not be wrapped witherrors.WithStack
- the wrong status code will be returned in the errorreturned here https://github.com/grpc/grpc-go/blob/v1.46.0/rpc_util.go#L833.
To resolve this, store the wrapped exception separately inside
ctxWithStacktrace
whenctxWithStacktrace.Err()
is called and set this toTimeoutError.cause
insidecontextutil.RunWithTimeout(...)
.Release note: None