log: add consistent tagging of failed assertions that are alertworthy #102041

dhartunian · 2023-04-21T21:18:07Z

Sometimes in production we have situations that don't rise to the level of panic (programmer error) or Fatalf (node should crash) but still want to alert the operator that something serious is going on that requires immediate attention. This cannot be simply codified into an ERROR severity log since not all ERROR level lines rise to the level of requiring operator involvement.

This ticket tracks consistent messaging for these types of errors so that they can be alerted upon easily. This can be done either through a log tag, or through some pre-set formatting structure in the error message.

Jira issue: CRDB-27233

The text was updated successfully, but these errors were encountered:

joshimhoff · 2023-05-15T14:02:33Z

I think the specific concept missing is "assertion has failed that indicates a code bug somewhere in CRDB period, but no need to crash the node given details of the assertion failure". So node liveness failing does not indicate this, even though it is bad. OTOH, see the discussion in this internal postmortem (https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/2985951249/2023-04-14+Postmortem+on+replicated+corruption+in+CC+clusters+following+use+of+crdb+internal.probe+ranges), culminating in:

[josh] is the warning in this case “there’s an intent in the liveness range!? Clearly something is fubar”. Missing concept in alert for a “very unexpected scenario”.

Intent on liveness range may not require restarting CRDB, but it does definitely indicate a code bug.

joshimhoff · 2023-05-15T14:03:21Z

CC @nvanbenschoten as this relates to an idea from Nathan / KV re: building a better assertion framework for roachprod-based tests.

nvanbenschoten · 2023-05-15T14:20:15Z

There's a thread about better, easier runtime assertion in #94986. As that thread discusses, the three considerations are 1) ergonomics, 2) cost in production, and 3) effect in production.

I'm bullish that if we get the framework right, we'll see quick adoption of assertions throughout the codebase. This is what we saw with the adoption of the testify/require library. The library was easy enough to use that developers began reaching for it in place of older approaches. In fact, it was so low enough that developers began writing better tests with more assertions.

srosenberg · 2023-05-21T23:50:48Z

I think the specific concept missing is "assertion has failed that indicates a code bug somewhere in CRDB period, but no need to crash the node given details of the assertion failure".

So, are we merely hinting at a new log-level? I can't say I've come across other systems that sandwich a new log-level between WARN, ERROR, and FATAL. If I understand the ask, we want to easily surface some assertion failures without crashing a node?
It's worth noting that non-fatal assertions can be rather counter-productive [1], which is one of the reasons you don't see them in the wild.

[1] https://blog.mozilla.org/nnethercote/2011/06/07/what-is-the-point-of-non-fatal-assertions/

srosenberg · 2023-05-21T23:55:22Z

This is what we saw with the adoption of the testify/require library. The library was easy enough to use that developers began reaching for it in place of older approaches. In fact, it was so low enough that developers began writing better tests with more assertions.

That's great to hear! However, we should also be wary of the error-prone API which comes with the testify/require library, e.g. [1], [2].

[1] #101028
[2] #100796

dhartunian · 2023-09-05T13:56:45Z

There is a bunch of working going on in #94986 to support better runtime assertions. Closing this as duplicate.

dhartunian added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-postmortem Originated from a Postmortem action item. T-observability-inf labels Apr 21, 2023

dhartunian self-assigned this Apr 21, 2023

blathers-crl bot added the A-observability-inf label Apr 21, 2023

srosenberg mentioned this issue Jul 11, 2023

util/must: add runtime assertion API #106508

Merged

dhartunian added the db-cy-23 label Jul 20, 2023

dhartunian closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log: add consistent tagging of failed assertions that are alertworthy #102041

log: add consistent tagging of failed assertions that are alertworthy #102041

dhartunian commented Apr 21, 2023 •

edited by cockroach-jira-scripts

Loading

joshimhoff commented May 15, 2023

joshimhoff commented May 15, 2023

nvanbenschoten commented May 15, 2023

srosenberg commented May 21, 2023 •

edited

Loading

srosenberg commented May 21, 2023

dhartunian commented Sep 5, 2023

log: add consistent tagging of failed assertions that are alertworthy #102041

log: add consistent tagging of failed assertions that are alertworthy #102041

Comments

dhartunian commented Apr 21, 2023 • edited by cockroach-jira-scripts Loading

joshimhoff commented May 15, 2023

joshimhoff commented May 15, 2023

nvanbenschoten commented May 15, 2023

srosenberg commented May 21, 2023 • edited Loading

srosenberg commented May 21, 2023

dhartunian commented Sep 5, 2023

dhartunian commented Apr 21, 2023 •

edited by cockroach-jira-scripts

Loading

srosenberg commented May 21, 2023 •

edited

Loading