util/assertion: add runtime assertion API #111991

erikgrinaker · 2023-10-07T14:22:29Z

This patch adds a canonical API for runtime assertions. It is intended to encourage liberal use of runtime assertions in a safe and performant manner. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure.

This PR follows the must API initially introduced in 3250477, which was later abandoned for performance reasons (when passing format args, these unconditionally incurred interface boxing allocations, even in the happy case). The new assertion API is significantly simpler to avoid this cost:

if foo != bar {
  return assertion.Failed(ctx, "oh no: %v != %v", foo, bar)
}

Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate, but this should be the exception rather than the norm.

It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds.

This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings:

log.Fatalf: kills the node even in release builds, which can cause severe disruption over often minor issues.
errors.AssertionFailedf: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds.
logcrash.ReportOrPanic: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available.
buildutil.CrdbTestBuild: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters.
util.RaceEnabled: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector.

For more details and examples, see the assertion package documentation.

Resolves #94986.
Touches #106508.
Touches #108272.

Epic: none
Release note: None

Epic: none Release note: None

cockroach-teamcity · 2023-10-07T14:22:41Z

This change is

RaduBerinde

This is extremely similar to errors.AssertionFailedf in terms of usage.. How will I know when to use assertion.Failed vs errors.AssertionFailedf? The latter is already in heavy use..

Why can't we improve AssertionFailedf to have this behavior? (it would require adding some hooks in the errors library, but I don't think it's too bad). Or, if we can't, we should replace it with the new one - i.e. change existing call sites and add a linter.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @knz)

-- commits line 59 at r2:
I don't get this point, given that you're using exactly this as the value for ExpensiveEnabled

RaduBerinde · 2023-10-07T16:22:13Z

One thing to keep in mind also - AssertionFailedf is used in Pebble too, if we made that do what we want we'd cover Pebble autonatically (whereas Pebble won't be able to use something from the cockroach repo).

This patch adds a canonical API for runtime assertions. It is intended to encourage liberal use of runtime assertions in a safe and performant manner. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure. This PR follows the `must` API initially introduced in 3250477, which was later abandoned for performance reasons (when passing format args, these unconditionally incurred interface boxing allocations, even in the happy case). The new `assertion` API is significantly simpler to avoid this cost: ```go if foo != bar { return assertion.Failed(ctx, "oh no: %v != %v", foo, bar) } ``` Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate, but this should be the exception rather than the norm. It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds. This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings: * `log.Fatalf`: kills the node even in release builds, which can cause severe disruption over often minor issues. * `errors.AssertionFailedf`: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds. * `logcrash.ReportOrPanic`: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available. * `buildutil.CrdbTestBuild`: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters. * `util.RaceEnabled`: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector. For more details and examples, see the `assertion` package documentation. Epic: none Release note: None

erikgrinaker

Why can't we improve AssertionFailedf to have this behavior?

It seems a bit unfortunate to me to couple the construction of an assertion error to the handling of assertion failures. It also seemed unfortunate to push this logic into a separate library. Not strongly opposed to it though. Would like to hear @knz's thoughts.

I'm also not sure it reads as well in cases where the failure is ignored/recovered in release builds, but maybe that's just me. Consider e.g.:

func (p *Processor) Stop(ctx context.Context) {
  if p.stopped {
    //_ = assertion.Failed(ctx, "already stopped")
    _ = errors.AssertionFailedf(ctx, "already stopped")
    return
  }
  p.stopped = true
}

AssertionFailedf is used in Pebble too, if we made that do what we want we'd cover Pebble autonatically

This is an interesting idea. Wdyt @jbowens?

we should replace it with the new one

Yeah, this is the intent, as well as other mechanisms such as logcrash.ReportOrPanic etc. For now, I'd just like to land the basic API and get the initial bikeshedding out of the way, then we can migrate code and address other follow-up work later.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @RaduBerinde)

-- commits line 59 at r2:

Previously, RaduBerinde wrote…

I don't get this point, given that you're using exactly this as the value for ExpensiveEnabled

This is temporary, to not have to have the whole build tag discussion right now. See #107425 for the follow-up issue (will reopen it when/if this lands). Added a TODO to clarify this.

RaduBerinde · 2023-10-07T16:39:07Z

If the intention is to indeed to replace AssertionFailedf, that is fine by me. But "it looks a bit better to me" is a pretty weak argument to change all existing uses (which has other downsides, eg conflicts in backports). The new API also couples the assertion with the creation of the error, its usage is exactly the same.

erikgrinaker · 2023-10-07T16:50:50Z

"it looks a bit better to me" is a pretty weak argument to change all existing uses (which has other downsides, eg conflicts in backports)

For sure. But regardless of which option we pick, we'll have to change a bunch of call sites -- logcrash.ReportOrPanic, log.Fatalf and panic are all widely used for runtime assertions, and should be migrated. The primary goal here is to come up with a straightforward canonical mechanism that generally does the right thing, so that people don't have to think about it too much.

The new API also couples the assertion with the creation of the error, its usage is exactly the same.

Yeah, I meant that it would still be possible to construct errors.AssertionFailedf separately without it triggering the assertion handling, which might occasionally be useful. Not a big deal though.

I think it's mostly a matter of deciding whether we feel like this is something the errors library should be responsible for or not, so I'll defer to @knz.

pav-kv

Flushing some comments.

pav-kv · 2023-10-09T16:44:46Z