Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/assertion: add runtime assertion API #111991

Closed
wants to merge 2 commits into from

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Oct 7, 2023

This patch adds a canonical API for runtime assertions. It is intended to encourage liberal use of runtime assertions in a safe and performant manner. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure.

This PR follows the must API initially introduced in 3250477, which was later abandoned for performance reasons (when passing format args, these unconditionally incurred interface boxing allocations, even in the happy case). The new assertion API is significantly simpler to avoid this cost:

if foo != bar {
  return assertion.Failed(ctx, "oh no: %v != %v", foo, bar)
}

Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate, but this should be the exception rather than the norm.

It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds.

This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings:

  • log.Fatalf: kills the node even in release builds, which can cause severe disruption over often minor issues.

  • errors.AssertionFailedf: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds.

  • logcrash.ReportOrPanic: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available.

  • buildutil.CrdbTestBuild: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters.

  • util.RaceEnabled: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector.

For more details and examples, see the assertion package documentation.

Resolves #94986.
Touches #106508.
Touches #108272.

Epic: none
Release note: None

Epic: none
Release note: None
@erikgrinaker erikgrinaker requested review from knz, RaduBerinde and a team October 7, 2023 14:22
@erikgrinaker erikgrinaker self-assigned this Oct 7, 2023
@erikgrinaker erikgrinaker requested a review from a team October 7, 2023 14:22
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the assertion branch 4 times, most recently from 1251366 to 890faa2 Compare October 7, 2023 14:42
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely similar to errors.AssertionFailedf in terms of usage.. How will I know when to use assertion.Failed vs errors.AssertionFailedf? The latter is already in heavy use..

Why can't we improve AssertionFailedf to have this behavior? (it would require adding some hooks in the errors library, but I don't think it's too bad). Or, if we can't, we should replace it with the new one - i.e. change existing call sites and add a linter.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @knz)


-- commits line 59 at r2:
I don't get this point, given that you're using exactly this as the value for ExpensiveEnabled

@RaduBerinde
Copy link
Member

One thing to keep in mind also - AssertionFailedf is used in Pebble too, if we made that do what we want we'd cover Pebble autonatically (whereas Pebble won't be able to use something from the cockroach repo).

This patch adds a canonical API for runtime assertions. It is intended
to encourage liberal use of runtime assertions in a safe and performant
manner. It does not attempt to reinvent the wheel, but instead builds on
existing infrastructure.

This PR follows the `must` API initially introduced in 3250477,
which was later abandoned for performance reasons (when passing format
args, these unconditionally incurred interface boxing allocations, even
in the happy case). The new `assertion` API is significantly simpler to
avoid this cost:

```go
if foo != bar {
  return assertion.Failed(ctx, "oh no: %v != %v", foo, bar)
}
```

Assertion failures are fatal in all non-release builds, including
roachprod clusters and roachtests, to ensure they will be noticed. In
release builds, they instead log the failure and report it to Sentry (if
enabled), and return an assertion error to the caller for propagation.
This avoids excessive disruption in production environments, where an
assertion failure is often scoped to an individual RPC request,
transaction, or range, and crashing the node can turn a minor problem
into a full-blown outage. It is still possible to kill the node when
appropriate, but this should be the exception rather than the norm.

It also supports expensive assertions that must be compiled out of
normal dev/test/release builds for performance reasons. These are
instead enabled in special test builds.

This is intended to be used instead of other existing assertion
mechanisms, which have various shortcomings:

* `log.Fatalf`: kills the node even in release builds, which can cause
  severe disruption over often minor issues.

* `errors.AssertionFailedf`: only suitable when we have an error return
  path, does not fatal in non-release builds, and are not always
  notified in release builds.

* `logcrash.ReportOrPanic`: panics rather than fatals, which can leave
  the node limping along. Requires the caller to implement separate
  assertion handling in release builds, which is easy to forget. Also
  requires propagating cluster settings, which aren't always available.

* `buildutil.CrdbTestBuild`: only enabled in Go tests, not roachtests,
  roachprod clusters, or production clusters.

* `util.RaceEnabled`: only enabled in race builds. Expensive assertions
  should be possible to run without the additional overhead of the race
  detector.

For more details and examples, see the `assertion` package documentation.

Epic: none
Release note: None
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we improve AssertionFailedf to have this behavior?

It seems a bit unfortunate to me to couple the construction of an assertion error to the handling of assertion failures. It also seemed unfortunate to push this logic into a separate library. Not strongly opposed to it though. Would like to hear @knz's thoughts.

I'm also not sure it reads as well in cases where the failure is ignored/recovered in release builds, but maybe that's just me. Consider e.g.:

func (p *Processor) Stop(ctx context.Context) {
  if p.stopped {
    //_ = assertion.Failed(ctx, "already stopped")
    _ = errors.AssertionFailedf(ctx, "already stopped")
    return
  }
  p.stopped = true
}

AssertionFailedf is used in Pebble too, if we made that do what we want we'd cover Pebble autonatically

This is an interesting idea. Wdyt @jbowens?

we should replace it with the new one

Yeah, this is the intent, as well as other mechanisms such as logcrash.ReportOrPanic etc. For now, I'd just like to land the basic API and get the initial bikeshedding out of the way, then we can migrate code and address other follow-up work later.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @RaduBerinde)


-- commits line 59 at r2:

Previously, RaduBerinde wrote…

I don't get this point, given that you're using exactly this as the value for ExpensiveEnabled

This is temporary, to not have to have the whole build tag discussion right now. See #107425 for the follow-up issue (will reopen it when/if this lands). Added a TODO to clarify this.

@RaduBerinde
Copy link
Member

If the intention is to indeed to replace AssertionFailedf, that is fine by me. But "it looks a bit better to me" is a pretty weak argument to change all existing uses (which has other downsides, eg conflicts in backports). The new API also couples the assertion with the creation of the error, its usage is exactly the same.

@erikgrinaker
Copy link
Contributor Author

"it looks a bit better to me" is a pretty weak argument to change all existing uses (which has other downsides, eg conflicts in backports)

For sure. But regardless of which option we pick, we'll have to change a bunch of call sites -- logcrash.ReportOrPanic, log.Fatalf and panic are all widely used for runtime assertions, and should be migrated. The primary goal here is to come up with a straightforward canonical mechanism that generally does the right thing, so that people don't have to think about it too much.

The new API also couples the assertion with the creation of the error, its usage is exactly the same.

Yeah, I meant that it would still be possible to construct errors.AssertionFailedf separately without it triggering the assertion handling, which might occasionally be useful. Not a big deal though.

I think it's mostly a matter of deciding whether we feel like this is something the errors library should be responsible for or not, so I'll defer to @knz.

Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing some comments.

Comment on lines +73 to +75
// if br.Error != nil {
// _ = assertion.Failed(ctx, "returned both br.Err=%v and err=%v", br.Err, err)
// }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important, but will this whole check be compiled out in release builds?

Upd: no, because we can enable the fatal behaviour via an env var.

Copy link
Contributor Author

@erikgrinaker erikgrinaker Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but this will also log the assertion failure and notify Sentry -- even if we don't kill the process, we still want to know that it happened.

//
// func (p *Processor) Stop(ctx context.Context) {
// if p.stopped {
// _ = assertion.Failed(ctx, "already stopped")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many cases the double-close thing isn't skippable. For example, imagine Processor has some pointer which it unconditionally uses, and sets to nil on Close. Skipping such a check can result in a nil dereference panic elsewhere, and we will have to guess why that happened. Whereas if we panic right here, we have the stack etc.

So I'm not sure that no panicking is a good default for an assertion. But we have Fatal and Panic funcs too, which is good.

Copy link
Contributor Author

@erikgrinaker erikgrinaker Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many cases the double-close thing isn't skippable.

Sure, the engineer will have to make a judgement call on how to safely handle the assertion failure -- the same is true for all error handling.

I'm not sure that no panicking is a good default for an assertion.

Defaulting to not fataling in release builds is exactly the point here though. I was on an escalation this Sunday where a fatal assertion caused a three-hour outage. I've seen many of these in the past, and in most cases the assertion did not have to be fatal and we would have avoided a long outage.

shouldFatal = envutil.EnvOrDefaultBool("COCKROACH_FATAL_ASSERTIONS", !build.IsRelease())

// MaybeSendReport is injected by package logcrash, for Sentry reporting.
MaybeSendReport func(ctx context.Context, err error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider setting this to a no-op func by default, so that we don't have to have the nil check. OTOH, nil check is safer, so up to you.

// Panic unconditionally panics with an assertion error, even in release builds.
// This is primarily for use in SQL code where errors are sometimes propagated
// as panics. Prefer Failed() when possible.
func Panic(ctx context.Context, format string, args ...interface{}) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func Panic(ctx context.Context, format string, args ...interface{}) {
func Panicf(ctx context.Context, format string, args ...interface{}) {

// Fatal unconditionally fatals the process, even in release builds or when
// fatal assertions are disabled. This should only be used when it is unsafe to
// keep the process running. Prefer Failed() when possible.
func Fatal(ctx context.Context, format string, args ...interface{}) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func Fatal(ctx context.Context, format string, args ...interface{}) {
func Fatalf(ctx context.Context, format string, args ...interface{}) {

shouldFatal = envutil.EnvOrDefaultBool("COCKROACH_FATAL_ASSERTIONS", !build.IsRelease())

// MaybeSendReport is injected by package logcrash, for Sentry reporting.
MaybeSendReport func(ctx context.Context, err error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in all builds that import util/log/logcrash package one way or another (but not in builds that don't), this will report to sentry, unless explicitly overridden? Even from local / test builds?

Or, is the reporting guarded by some other condition (like, this is a release build or something)? By looking at the code briefly, it seems like we're not always reporting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in all builds that import util/log/logcrash package one way or another (but not in builds that don't), this will report to sentry, unless explicitly overridden?

It's enabled in binaries. The function is injected when importing logcrash, but the reporting is gated on a default-true setting, and it relies on injecting a settings object that's done by the cli package. This is the same logic log.Fatalf already uses. See this code:

func maybeSendCrashReport(ctx context.Context, err error, reportType ReportType) {
// We load the ReportingSettings from global singleton in this call path.
if sv := getGlobalSettings(); sv != nil {
sendCrashReport(ctx, sv, err, reportType)
}
}

} else {
log.ErrorfDepth(ctx, depth, "%+v", err)
if MaybeSendReport != nil { // can be nil in tests
MaybeSendReport(ctx, err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this blocking on the actual send? Up to 10s? Maybe call this out (to some level of detail) in higher level func comments.

A 10s delay is not a big deal if what follows is a panic/fatal, but if we keep running as normal, ideally we want this report be sent asynchronously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it blocks. We can't send it asynchronously, because if we fatal the process then the asynchronous code doesn't run. We could do this asynchronously for non-fatal assertions though, if we feel like it's necessary -- in that case, we'll also need to add synchronization to make sure we don't terminate the process before it's sent, and also bound the number of goroutines we spawn if we e.g. hit an assertion failure in a hot path or tight loop. For the same reason, we probably want to rate limit this.

I'll write up a follow-up issue to deal with this if/when we merge this.

@knz
Copy link
Contributor

knz commented Oct 10, 2023

It seems a bit unfortunate to me to couple the construction of an assertion error to the handling of assertion failures.

I'm not sure about "unfortunate" but there's another perspective: the cockroachdb/errors library has quite a few other downstream users besides CockroachDB now so we cannot nilly-willy redefine AssertionFailedf to interrupt the control flow just because that's what we want.

However, I would be supportive of a new package cockroachdb/assertion which both CockroachDB and Pebble can import, that would contain this functionality.

@knz
Copy link
Contributor

knz commented Oct 10, 2023

(Then we could push for replacing existing uses inside CockroachDB by the new library, where relevant, perhaps through a linter.)

@erikgrinaker
Copy link
Contributor Author

It seems a bit unfortunate to me to couple the construction of an assertion error to the handling of assertion failures.

I'm not sure about "unfortunate" but there's another perspective: the cockroachdb/errors library has quite a few other downstream users besides CockroachDB now so we cannot nilly-willy redefine AssertionFailedf to interrupt the control flow just because that's what we want.

I think the idea would be to add a hook that's triggered when AssertionFailedf is called, and we'd have to explicitly add a hook that does this.

However, I would be supportive of a new package cockroachdb/assertion which both CockroachDB and Pebble can import, that would contain this functionality.

That works too, if Pebble wants in.

@erikgrinaker
Copy link
Contributor Author

Closing following internal discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

testing: runtime assertions
5 participants