Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/must: add runtime assertion API #106508

Merged
merged 4 commits into from
Jul 28, 2023
Merged

Conversation

erikgrinaker
Copy link
Contributor

For details and usage examples, see the package documentation.


This patch adds a convenient and canonical API for runtime assertions, inspired by the Testify package used for Go test assertions. It is intended to encourage liberal use of runtime assertions throughout the code base, by making it as easy as possible to write assertions that follow best practices. It does not attempt to reinvent the wheel, but instead builds on existing infrastructure.

Assertion failures are fatal in all non-release builds, including roachprod clusters and roachtests, to ensure they will be noticed. In release builds, they instead log the failure and report it to Sentry (if enabled), and return an assertion error to the caller for propagation. This avoids excessive disruption in production environments, where an assertion failure is often scoped to an individual RPC request, transaction, or range, and crashing the node can turn a minor problem into a full-blown outage. It is still possible to kill the node when appropriate via log.Fatalf, but this should be the exception rather than the norm.

It also supports expensive assertions that must be compiled out of normal dev/test/release builds for performance reasons. These are instead enabled in special test builds.

This is intended to be used instead of other existing assertion mechanisms, which have various shortcomings:

  • log.Fatalf: kills the node even in release builds, which can cause severe disruption over often minor issues.

  • errors.AssertionFailedf: only suitable when we have an error return path, does not fatal in non-release builds, and are not always notified in release builds.

  • logcrash.ReportOrPanic: panics rather than fatals, which can leave the node limping along. Requires the caller to implement separate assertion handling in release builds, which is easy to forget. Also requires propagating cluster settings, which aren't always available.

  • buildutil.CrdbTestBuild: only enabled in Go tests, not roachtests, roachprod clusters, or production clusters.

  • util.RaceEnabled: only enabled in race builds. Expensive assertions should be possible to run without the additional overhead of the race detector.

For more details and examples, see the must package documentation.

Resolves #94986.
Epic: none
Release note: None

@erikgrinaker erikgrinaker requested a review from a team July 10, 2023 12:07
@erikgrinaker erikgrinaker self-assigned this Jul 10, 2023
@erikgrinaker erikgrinaker requested a review from a team July 10, 2023 12:07
@erikgrinaker erikgrinaker requested review from a team as code owners July 10, 2023 12:07
@erikgrinaker erikgrinaker requested review from nkodali and removed request for a team July 10, 2023 12:07
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested review from a team, smg260 and renatolabs and removed request for nkodali and a team July 10, 2023 12:30
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked at the high level design and UX and everything checks out as far as I'm concerned. I would be happy to see this reviewed by TE (or if TE are too swamped I can review on their behalf, just let me know) and get merged soon.

pkg/util/must/must.go Outdated Show resolved Hide resolved
if ctx.Done() != nil {
return handleRaftReadyStats{}, errors.AssertionFailedf(
"handleRaftReadyRaftMuLocked cannot be called with a cancellable context")
if err := must.Zero(ctx, ctx.Done(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does must.Nil work here, to match the previous ctx.Done() != nil check?

Copy link
Contributor Author

@erikgrinaker erikgrinaker Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no. must.Nil only works with pointers, since it's generic over *T. I haven't found a way to also include reference types (slices, maps, channels, funcs, etc) using generics, nor interfaces. We could use reflection, but that would be too expensive here.

However, for reference types and interfaces the zero value is nil, so must.Zero() here is exactly equivalent to == nil.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need a "nilable" constraint in the upstream package :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could enumerate these things manually, to have a constraint that matches all the nilable things. Haven't researched whether the syntax allows for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went down that path when implementing Len(), trying to get it to work both with slices, maps, and strings. I sort of got there, but it confused type inference, so we'd have to manually instantiate the generic function at all call sites which gets really annoying.

Might still be possible, but I didn't immediately find a way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think must.Zero is a surprising developer UX to test nil interface references.

Would it be possible to introduce an alias must.NilInterface or NilReference and recommend its use for that case?

Copy link
Contributor Author

@erikgrinaker erikgrinaker Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we'd need one for each reference type, so about 6 in total (with interfaces), plus the NotNil variants, so 12. I've been hoping to find a way to finagle generics into doing it instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about 6 in total (with interfaces)

I was thinking about focusing on just interfaces, which is the most common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interfaces are also tricky, because we'd have to handle typed nils -- I'm not sure we can without reflection. But there's always the must.True(ctx, foo == nil) escape hatch too. I agree though, we should try to do something better here.

pkg/kv/kvserver/replica_raft.go Outdated Show resolved Hide resolved
@erikgrinaker erikgrinaker force-pushed the must branch 2 times, most recently from f302de7 to bf16d46 Compare July 10, 2023 13:55
Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! Overall LGTM - I left some context into types of assertions we use in the optimizer in case that's helpful in steering the API, but I'm not sure if the patterns are widespread, so feel free to ignore.

Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)


pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):

	}
	must.MaybeSendReport = func(ctx context.Context, err error) {
		maybeSendCrashReport(ctx, err, ReportTypeAssertionFailure)

maybeSendCrashReport always using global settings, which is discouraged when we have access to non-global settings:

// Should be used only when strictly necessary; use ReportPanic whenever we have
// access to the settings.

Should the package provide versions of each function that have settings arguments?


pkg/util/must/must.go line 44 at r8 (raw file):

// different node (thus stack traces can be insufficient by themselves).
//
// Some assertions may be too expensive to always run. For example, we may want

We have a set of assertions in the optimizer that run for all test builds, but not in release builds. Their cost is somewhere in between "cheap enough to run in release builds" and "too expensive to run for every test". I'm not sure if this type of assertion is used elsewhere, but if it is, it might be worth providing a pattern for.

Another pattern used in the optimizer (and a few other places IIRC) is to panic to propagate an assertion failure upward until it gets turned into an internal error by a panic-catcher.

Translating one of these panics to use must would look something like:

-if col == nil {
-    panic(errors.AssertionFailedf("expected to find column %d in scope", colID))
-}
+if err := must.NotNil(ctx, col, "expected to find column %d in scope", colID); err != nil {
+    panic(err)
+}

They seem ergonomically similar to me, but the latter has added complexity of either fatalling in must.NotNil or returning and error to panic. Should we provide versions of these assertions that always panic? Again, I'm not convinced this is a use case that must should cater to, but thought I'd point it out.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)


pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

maybeSendCrashReport always using global settings, which is discouraged when we have access to non-global settings:

// Should be used only when strictly necessary; use ReportPanic whenever we have
// access to the settings.

Should the package provide versions of each function that have settings arguments?

I generally agree, but I think it's important for an API like this to reduce friction as much as possible, and having to plumb through settings is exactly the kind of friction that might discourage people from using it.

I'm feeling kind of okay about this because log.Fatalf does exactly the same thing, for exactly the same reason. I think the main argument for plumbing through settings here is testability, and I think that's an ok tradeoff to make? Does plumbing through settings give us any other benefit?


pkg/util/must/must.go line 44 at r8 (raw file):

Their cost is somewhere in between "cheap enough to run in release builds" and "too expensive to run for every test". I'm not sure if this type of assertion is used elsewhere, but if it is, it might be worth providing a pattern for.

How about something like must.TestOnly(func() { ... }), which is similar to must.Expensive() but gated on CrdbTestBuild?

Another pattern used in the optimizer (and a few other places IIRC) is to panic to propagate an assertion failure upward until it gets turned into an internal error by a panic-catcher.

Right, we could add a helper for this, something like:

must.PanicIf(must.NotNil(ctx, col, "expected to find column %d in scope", colID))

I believe this is marginally better, because in non-release builds the log.Fatalf is guaranteed to not be caught by any panic handlers anywhere, so it'll always fail loudly.

Copy link
Collaborator

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, @renatolabs, @smg260, and @tbg)


pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):

Does plumbing through settings give us any other benefit?

I'm trying to figure that out, but now confused. It looks like only the cli package sets global settings. If that was the case, then crash reports would never be sent. I must be missing something.

Did you test that Sentry reports are created?


pkg/util/must/must.go line 44 at r8 (raw file):

How about something like must.TestOnly(func() { ... }), which is similar to must.Expensive() but gated on CrdbTestBuild?

SGTM!

I believe this is marginally better, because in non-release builds the log.Fatalf is guaranteed to not be caught by any panic handlers anywhere, so it'll always fail loudly.

👍 We actually rely a bit on the distinction between an internal error that is caught (looks like ERROR: internal error ...) versus a node-crashing panic. The latter causes much more problems for customers so we urgently prioritize fixing those. But we can figure out a workaround for this as we translate our assertions.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for setting this up, Erik!

I'd be happy for us to adopt an API like this throughout the codebase. Should we also mark CrdbTestBuild as deprecated in the code? It'd be nice if we eventually removed that build tag and replaced its current uses with calls to must and must.Expensive.

pkg/util/must/must.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/replica_raft.go Outdated Show resolved Hide resolved
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @mgartner, @pavelkalinnikov, @smg260, and @tbg)


pkg/util/must/must.go line 44 at r8 (raw file):

How about something like must.TestOnly(func() { ... }), which is similar to must.Expensive() but gated on CrdbTestBuild?

Maybe I misunderstand, but how is TestOnly different from Expensive? AFAICT, both CrdbTestBuild and Invariants are set to true based on the same crdb_test condition.

@rickystewart
Copy link
Collaborator

Maybe I misunderstand, but how is TestOnly different from Expensive? AFAICT, both CrdbTestBuild and Invariants are set to true based on the same crdb_test condition.

Looks like Invariants is set on the build tags invariants and race, NOT crdb_test.

@renatolabs
Copy link
Contributor

Ah right, I misread the build tags, thank you.

I'm still wondering to what extent we can unify TestOnly (if we decide to add it) and Expensive. The former would have the same limitation mentioned in the PR description ("only enabled in Go tests, not roachtests, roachprod clusters, or production clusters"), and might provide a false sense of coverage since Test is ambiguous (e.g., it's common for people to think that their assertions are running in roachtests).

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on a first pass... Several things I am not a fan of,

  • allowing assertion failure to be non-fatal is a can of worms (see my earlier comment [1])
  • Expensive is an unnecessary addition--compiling assertions into a release build is by design an expensive choice
  • syntax sugar like Equal, NotEqual, Less, etc. detracts from the parsimonious design; whatever value it adds, it negates it by making the API more error-prone, imo

In my mind, an ideal design would expose nothing more than must.True and must.False under the semantics that if you compiled with crdb_test, and the invoked predicate evaluates to false, then you crash. The reason for such (extreme) minimalism is I shouldn't have to think how an assertion is evaluated, much less which predicate to use to assert it. That often leads to errors, in my experience. An assertion mechanism should be as simple as the meaning of the logical predicate, P(x).

[1] #102041 (comment)

Reviewed 1 of 5 files at r12.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @mgartner, @pavelkalinnikov, @smg260, and @tbg)


pkg/util/must/must.go line 61 at r13 (raw file):

//	}
//
// Example: double-stopping a component. In release builds, we can simply ignore

Fundamentally, it seems wrong to ignore an invariant violation in release builds. Remember, software testing doesn't guarantee an absence of errors. We've seen bugs that are hard to nearly impossible to reproduce in our test environments.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering to what extent we can unify TestOnly (if we decide to add it) and Expensive.

Yeah, I sort of feel the same way. The idea would be to set up an additional nightly run of expensive/invariant tests, which would run the same set of tests that are covered by CrdbTestBuild, so it isn't clear to me what value we get from gating them on CrdbTestBuild instead of Expensive. They'll still exercise exactly the same tests, with the same frequency. But I don't feel very strongly about this, if SQL have a strong preference for continuing to use CrdbTestBuild then that's ok by me.

allowing assertion failure to be non-fatal is a can of worms

We regularly see fatal assertions cause hour-long outages, which may require shipping a custom binary to disable the assertion (which again takes hours). Fatal assertions in production are too big of a risk, and makes people scared of using assertions. They are also largely unnecessary, because most failures are scoped to a single request/transaction/range rather than an entire node, so killing the node causes a ton of collateral damage. We do want to know when they happen in the wild though, because they indicate uncaught bugs and gaps in test coverage.

Expensive is an unnecessary addition--compiling assertions into a release build is by design an expensive choice

There is a big difference between a boolean conditional and something like MVCC stats assertions or Pebble's invariant checks -- the former is negligible, the latter will reduce end-to-end performance by at least one order of magnitude, and is entirely unsuitable to enable in non-release builds, but provides valuable coverage.

syntax sugar like Equal, NotEqual, Less, etc. detracts from the parsimonious design; whatever value it adds, it negates it by making the API more error-prone, imo

The same could be said for Testify -- it's equivalent to t.Fatal, and yet people clearly prefer it. If it makes people write more tests, then I'm all for it.

I think one of the main values for these helpers is in always including argument values in assertion failures, so that people don't have to do this by hand over and over -- or worse yet, forget it and we get a useless failure report. But one could also imagine other conveniences, like printing diffs when reporting struct inequality, or less trivial assertion logic -- I've kept it pretty bare-bones initially.

None of this is essential, but the point of a convenience API is to be, well, convenient, which encourages wider use.

In my mind, an ideal design would expose nothing more than must.True and must.False under the semantics that if you compiled with crdb_test, and the invoked predicate evaluates to false, then you crash.

What would we do in a release build when encountering the condition that would have caused the compiled-out assertion to fire? Ignore it? With most existing assertions this would currently result in correctness violations. I think there's a valuable middle ground between "always fatal" and "always ignore it". Much of the motivation here is precisely to get coverage in real production environments, with minimal disruption, because we continue to hit issues that aren't caught by our tests.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner, @pavelkalinnikov, @renatolabs, @smg260, @srosenberg, and @tbg)


pkg/util/log/logcrash/crash_reporting.go line 441 at r8 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Does plumbing through settings give us any other benefit?

I'm trying to figure that out, but now confused. It looks like only the cli package sets global settings. If that was the case, then crash reports would never be sent. I must be missing something.

Did you test that Sentry reports are created?

The cli package is the entry point of the cockroach binary, so it's hooked up in the binary, which is what we care about here.


pkg/util/must/must.go line 61 at r13 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Fundamentally, it seems wrong to ignore an invariant violation in release builds. Remember, software testing doesn't guarantee an absence of errors. We've seen bugs that are hard to nearly impossible to reproduce in our test environments.

We can ignore it when it's safe to do so. The caller is responsible for judging what the appropriate action is in release builds, just like they're responsible for judging what to do with any other error condition.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Raft assertions will need some more scrutiny, so I've pulled them out of this PR. Instead, I added a handful of simpler but representative examples.

I'll write up a bunch of follow-up issues for the remaining work tomorrow.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @knz, @mgartner, @pavelkalinnikov, @renatolabs, @smg260, @srosenberg, and @tbg)

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 23, 2023

@erikgrinaker
Copy link
Contributor Author

I'm going to merge this, for any further concerns I suggest opening an issue and we can discuss there.

bors r+

Needed to pull in `constraints.Ordered`.

Epic: none
Release note: None
Epic: none
Release note: None
This patch adds a convenient and canonical API for runtime assertions,
inspired by the Testify package used for Go test assertions. It is
intended to encourage liberal use of runtime assertions throughout the
code base, by making it as easy as possible to write assertions that
follow best practices. It does not attempt to reinvent the wheel, but
instead builds on existing infrastructure.

Assertion failures are fatal in all non-release builds, including
roachprod clusters and roachtests, to ensure they will be noticed. In
release builds, they instead log the failure and report it to Sentry (if
enabled), and return an assertion error to the caller for propagation.
This avoids excessive disruption in production environments, where an
assertion failure is often scoped to an individual RPC request,
transaction, or range, and crashing the node can turn a minor problem
into a full-blown outage. It is still possible to kill the node when
appropriate via `log.Fatalf`, but this should be the exception rather
than the norm.

It also supports expensive assertions that must be compiled out of
normal dev/test/release builds for performance reasons. These are
instead enabled in special test builds.

This is intended to be used instead of other existing assertion
mechanisms, which have various shortcomings:

* `log.Fatalf`: kills the node even in release builds, which can cause
  severe disruption over often minor issues.

* `errors.AssertionFailedf`: only suitable when we have an error return
  path, does not fatal in non-release builds, and are not always
  notified in release builds.

* `logcrash.ReportOrPanic`: panics rather than fatals, which can leave
  the node limping along. Requires the caller to implement separate
  assertion handling in release builds, which is easy to forget. Also
  requires propagating cluster settings, which aren't always available.

* `buildutil.CrdbTestBuild`: only enabled in Go tests, not roachtests,
  roachprod clusters, or production clusters.

* `util.RaceEnabled`: only enabled in race builds. Expensive assertions
  should be possible to run without the additional overhead of the race
  detector.

For more details and examples, see the `must` package documentation.

Epic: none
Release note: None
Epic: none
Release note: None
@craig
Copy link
Contributor

craig bot commented Jul 28, 2023

Canceled.

@erikgrinaker
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 28, 2023

This PR was included in a batch that was canceled, it will be automatically retried

@craig craig bot merged commit 04c91a5 into cockroachdb:master Jul 28, 2023
@craig
Copy link
Contributor

craig bot commented Jul 28, 2023

Build succeeded:

craig bot pushed a commit that referenced this pull request Aug 11, 2023
108272: util/must: revert API r=erikgrinaker a=erikgrinaker

This patch reverts #106508, since `@RaduBerinde` [pointed out](#107790 (review)) a performance flaw where it will often incur an allocation on the happy path due to interface boxing of the format args. Other options are considered in #108169.

We'll revisit runtime assertions with a different API that avoids this cost on the happy path.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@erikgrinaker erikgrinaker deleted the must branch November 14, 2023 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

testing: runtime assertions
10 participants