-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util/assertion: add runtime assertion API #111991
Conversation
Epic: none Release note: None
1251366
to
890faa2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely similar to errors.AssertionFailedf
in terms of usage.. How will I know when to use assertion.Failed
vs errors.AssertionFailedf
? The latter is already in heavy use..
Why can't we improve AssertionFailedf
to have this behavior? (it would require adding some hooks in the errors
library, but I don't think it's too bad). Or, if we can't, we should replace it with the new one - i.e. change existing call sites and add a linter.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @knz)
-- commits
line 59 at r2:
I don't get this point, given that you're using exactly this as the value for ExpensiveEnabled
One thing to keep in mind also - AssertionFailedf is used in Pebble too, if we made that do what we want we'd cover Pebble autonatically (whereas Pebble won't be able to use something from the cockroach repo). |
This patch adds a canonical API for runtime assertions. It is intended to encourage liberal use of runtime assertions in a safe and performant manner. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure. This PR follows the `must` API initially introduced in 3250477, which was later abandoned for performance reasons (when passing format args, these unconditionally incurred interface boxing allocations, even in the happy case). The new `assertion` API is significantly simpler to avoid this cost: ```go if foo != bar { return assertion.Failed(ctx, "oh no: %v != %v", foo, bar) } ``` Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate, but this should be the exception rather than the norm. It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds. This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings: * `log.Fatalf`: kills the node even in release builds, which can cause severe disruption over often minor issues. * `errors.AssertionFailedf`: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds. * `logcrash.ReportOrPanic`: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available. * `buildutil.CrdbTestBuild`: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters. * `util.RaceEnabled`: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector. For more details and examples, see the `assertion` package documentation. Epic: none Release note: None
890faa2
to
c7ff5ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we improve
AssertionFailedf
to have this behavior?
It seems a bit unfortunate to me to couple the construction of an assertion error to the handling of assertion failures. It also seemed unfortunate to push this logic into a separate library. Not strongly opposed to it though. Would like to hear @knz's thoughts.
I'm also not sure it reads as well in cases where the failure is ignored/recovered in release builds, but maybe that's just me. Consider e.g.:
func (p *Processor) Stop(ctx context.Context) {
if p.stopped {
//_ = assertion.Failed(ctx, "already stopped")
_ = errors.AssertionFailedf(ctx, "already stopped")
return
}
p.stopped = true
}
AssertionFailedf is used in Pebble too, if we made that do what we want we'd cover Pebble autonatically
This is an interesting idea. Wdyt @jbowens?
we should replace it with the new one
Yeah, this is the intent, as well as other mechanisms such as logcrash.ReportOrPanic
etc. For now, I'd just like to land the basic API and get the initial bikeshedding out of the way, then we can migrate code and address other follow-up work later.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @RaduBerinde)
Previously, RaduBerinde wrote…
I don't get this point, given that you're using exactly this as the value for
ExpensiveEnabled
This is temporary, to not have to have the whole build tag discussion right now. See #107425 for the follow-up issue (will reopen it when/if this lands). Added a TODO to clarify this.
If the intention is to indeed to replace AssertionFailedf, that is fine by me. But "it looks a bit better to me" is a pretty weak argument to change all existing uses (which has other downsides, eg conflicts in backports). The new API also couples the assertion with the creation of the error, its usage is exactly the same. |
For sure. But regardless of which option we pick, we'll have to change a bunch of call sites --
Yeah, I meant that it would still be possible to construct I think it's mostly a matter of deciding whether we feel like this is something the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing some comments.
// if br.Error != nil { | ||
// _ = assertion.Failed(ctx, "returned both br.Err=%v and err=%v", br.Err, err) | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important, but will this whole check be compiled out in release builds?
Upd: no, because we can enable the fatal behaviour via an env var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but this will also log the assertion failure and notify Sentry -- even if we don't kill the process, we still want to know that it happened.
// | ||
// func (p *Processor) Stop(ctx context.Context) { | ||
// if p.stopped { | ||
// _ = assertion.Failed(ctx, "already stopped") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many cases the double-close thing isn't skippable. For example, imagine Processor
has some pointer which it unconditionally uses, and sets to nil
on Close
. Skipping such a check can result in a nil dereference panic elsewhere, and we will have to guess why that happened. Whereas if we panic right here, we have the stack etc.
So I'm not sure that no panicking is a good default for an assertion. But we have Fatal
and Panic
funcs too, which is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many cases the double-close thing isn't skippable.
Sure, the engineer will have to make a judgement call on how to safely handle the assertion failure -- the same is true for all error handling.
I'm not sure that no panicking is a good default for an assertion.
Defaulting to not fataling in release builds is exactly the point here though. I was on an escalation this Sunday where a fatal assertion caused a three-hour outage. I've seen many of these in the past, and in most cases the assertion did not have to be fatal and we would have avoided a long outage.
shouldFatal = envutil.EnvOrDefaultBool("COCKROACH_FATAL_ASSERTIONS", !build.IsRelease()) | ||
|
||
// MaybeSendReport is injected by package logcrash, for Sentry reporting. | ||
MaybeSendReport func(ctx context.Context, err error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider setting this to a no-op func by default, so that we don't have to have the nil check. OTOH, nil check is safer, so up to you.
// Panic unconditionally panics with an assertion error, even in release builds. | ||
// This is primarily for use in SQL code where errors are sometimes propagated | ||
// as panics. Prefer Failed() when possible. | ||
func Panic(ctx context.Context, format string, args ...interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func Panic(ctx context.Context, format string, args ...interface{}) { | |
func Panicf(ctx context.Context, format string, args ...interface{}) { |
// Fatal unconditionally fatals the process, even in release builds or when | ||
// fatal assertions are disabled. This should only be used when it is unsafe to | ||
// keep the process running. Prefer Failed() when possible. | ||
func Fatal(ctx context.Context, format string, args ...interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func Fatal(ctx context.Context, format string, args ...interface{}) { | |
func Fatalf(ctx context.Context, format string, args ...interface{}) { |
shouldFatal = envutil.EnvOrDefaultBool("COCKROACH_FATAL_ASSERTIONS", !build.IsRelease()) | ||
|
||
// MaybeSendReport is injected by package logcrash, for Sentry reporting. | ||
MaybeSendReport func(ctx context.Context, err error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in all builds that import util/log/logcrash
package one way or another (but not in builds that don't), this will report to sentry, unless explicitly overridden? Even from local / test builds?
Or, is the reporting guarded by some other condition (like, this is a release build or something)? By looking at the code briefly, it seems like we're not always reporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in all builds that import util/log/logcrash package one way or another (but not in builds that don't), this will report to sentry, unless explicitly overridden?
It's enabled in binaries. The function is injected when importing logcrash
, but the reporting is gated on a default-true setting, and it relies on injecting a settings object that's done by the cli
package. This is the same logic log.Fatalf
already uses. See this code:
cockroach/pkg/util/log/logcrash/crash_reporting.go
Lines 453 to 458 in c7ff5ee
func maybeSendCrashReport(ctx context.Context, err error, reportType ReportType) { | |
// We load the ReportingSettings from global singleton in this call path. | |
if sv := getGlobalSettings(); sv != nil { | |
sendCrashReport(ctx, sv, err, reportType) | |
} | |
} |
} else { | ||
log.ErrorfDepth(ctx, depth, "%+v", err) | ||
if MaybeSendReport != nil { // can be nil in tests | ||
MaybeSendReport(ctx, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this blocking on the actual send? Up to 10s? Maybe call this out (to some level of detail) in higher level func comments.
A 10s delay is not a big deal if what follows is a panic/fatal, but if we keep running as normal, ideally we want this report be sent asynchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it blocks. We can't send it asynchronously, because if we fatal the process then the asynchronous code doesn't run. We could do this asynchronously for non-fatal assertions though, if we feel like it's necessary -- in that case, we'll also need to add synchronization to make sure we don't terminate the process before it's sent, and also bound the number of goroutines we spawn if we e.g. hit an assertion failure in a hot path or tight loop. For the same reason, we probably want to rate limit this.
I'll write up a follow-up issue to deal with this if/when we merge this.
I'm not sure about "unfortunate" but there's another perspective: the cockroachdb/errors library has quite a few other downstream users besides CockroachDB now so we cannot nilly-willy redefine However, I would be supportive of a new package |
(Then we could push for replacing existing uses inside CockroachDB by the new library, where relevant, perhaps through a linter.) |
I think the idea would be to add a hook that's triggered when
That works too, if Pebble wants in. |
Closing following internal discussion. |
This patch adds a canonical API for runtime assertions. It is intended to encourage liberal use of runtime assertions in a safe and performant manner. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure.
This PR follows the
must
API initially introduced in 3250477, which was later abandoned for performance reasons (when passing format args, these unconditionally incurred interface boxing allocations, even in the happy case). The newassertion
API is significantly simpler to avoid this cost:Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate, but this should be the exception rather than the norm.
It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds.
This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings:
log.Fatalf
: kills the node even in release builds, which can cause severe disruption over often minor issues.errors.AssertionFailedf
: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds.logcrash.ReportOrPanic
: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available.buildutil.CrdbTestBuild
: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters.util.RaceEnabled
: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector.For more details and examples, see the
assertion
package documentation.Resolves #94986.
Touches #106508.
Touches #108272.
Epic: none
Release note: None