-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log: add consistent tagging of failed assertions that are alertworthy #102041
Comments
I think the specific concept missing is "assertion has failed that indicates a code bug somewhere in CRDB period, but no need to crash the node given details of the assertion failure". So node liveness failing does not indicate this, even though it is bad. OTOH, see the discussion in this internal postmortem (https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/2985951249/2023-04-14+Postmortem+on+replicated+corruption+in+CC+clusters+following+use+of+crdb+internal.probe+ranges), culminating in:
Intent on liveness range may not require restarting CRDB, but it does definitely indicate a code bug. |
CC @nvanbenschoten as this relates to an idea from Nathan / KV re: building a better assertion framework for roachprod-based tests. |
There's a thread about better, easier runtime assertion in #94986. As that thread discusses, the three considerations are 1) ergonomics, 2) cost in production, and 3) effect in production. I'm bullish that if we get the framework right, we'll see quick adoption of assertions throughout the codebase. This is what we saw with the adoption of the |
So, are we merely hinting at a new log-level? I can't say I've come across other systems that sandwich a new log-level between [1] https://blog.mozilla.org/nnethercote/2011/06/07/what-is-the-point-of-non-fatal-assertions/ |
That's great to hear! However, we should also be wary of the error-prone API which comes with the testify/require library, e.g. [1], [2]. |
There is a bunch of working going on in #94986 to support better runtime assertions. Closing this as duplicate. |
Sometimes in production we have situations that don't rise to the level of
panic
(programmer error) orFatalf
(node should crash) but still want to alert the operator that something serious is going on that requires immediate attention. This cannot be simply codified into anERROR
severity log since not allERROR
level lines rise to the level of requiring operator involvement.This ticket tracks consistent messaging for these types of errors so that they can be alerted upon easily. This can be done either through a log tag, or through some pre-set formatting structure in the error message.
Jira issue: CRDB-27233
The text was updated successfully, but these errors were encountered: