-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: adjust recently added logging events for txn row count guardrails #69945
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)
pkg/sql/conn_executor_exec.go, line 1168 at r1 (raw file):
// handleTxnRowsGuardrails handles either "written" or "read" rows guardrails. func (ex *connExecutor) handleTxnRowsGuardrails(
I've stared at this method for quite some time, but I don't see how to make it less complex without duplicating logic (i.e. extracting separate methods for "external" vs "internal" executors, etc), so I'm looking forward to suggestions on how to improve here. Any comments that I could think of seemed quite obvious to me (maybe because I stared for too long).
pkg/util/log/eventpb/sql_audit_events.proto, line 133 at r1 (raw file):
// LimitValue indicates the value of the transaction row count limit that was // reached. int64 limit_value = 3 [(gogoproto.jsontag) = ",omitempty"];
I removed two of the fields Raphael mentioned in the issue, but I think it is worth keeping the "limit" field, so I renamed it, but I'm open to suggestions on the naming and/or comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that IsRead
was redundant with the event types, but I think I might have actually done the opposite 😆 (that is, I would have unified the read/written event types into a single type, and kept a boolean field indicating whether this was rows read or written). That would simplify handleTxnRowsGuardrails
a bit and also simplify any readers of the structured events. (But that's pretty subjective, I think what you've done here is also fine. Maybe @knz has a stronger opinion?)
Another idea for simplifying handleTxnRowsGuardrails
: replace LimitValue
with NumRows
which then wouldn't require all the logLimit =
and might be easier for a customer to understand.
@knz If you took issue with these structured events, then I wonder what you think of CommonLargeRowDetails
used by LargeRow
and LargeRowInternal
? 🤔 Maybe we should also get rid of ViolatesMaxRowSizeErr
in those events?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)
pkg/sql/conn_executor_exec.go, line 1205 at r1 (raw file):
Quoted 9 lines of code…
if shouldErr { // But this time we also reached the error limit, so we want to log // an event again (it will have ViolatesTxnRowsLimitErr set to // true). Note that we couldn't have reached the error limit when we // logged the event the previous time because that would have // aborted the execution of the transaction. shouldLog = true logLimit = errLimit } else {
We could consider deleting this logic and moving the rest of this alreadyLogged
check up above the executorTypeInternal
check. I don't really think we have to log if shouldErr
is true. And don't we already have a way to log all query errors?
ae562cb
to
33e1340
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions. I changed LimitValue
to NumRows
, but I prefer to keep the distinction between different error types so that the errors were more useful.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)
pkg/sql/conn_executor_exec.go, line 1168 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I've stared at this method for quite some time, but I don't see how to make it less complex without duplicating logic (i.e. extracting separate methods for "external" vs "internal" executors, etc), so I'm looking forward to suggestions on how to improve here. Any comments that I could think of seemed quite obvious to me (maybe because I stared for too long).
Ok, I think the logic here is now pretty simple.
pkg/sql/conn_executor_exec.go, line 1205 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
if shouldErr { // But this time we also reached the error limit, so we want to log // an event again (it will have ViolatesTxnRowsLimitErr set to // true). Note that we couldn't have reached the error limit when we // logged the event the previous time because that would have // aborted the execution of the transaction. shouldLog = true logLimit = errLimit } else {
We could consider deleting this logic and moving the rest of this
alreadyLogged
check up above theexecutorTypeInternal
check. I don't really think we have to log ifshouldErr
is true. And don't we already have a way to log all query errors?
Thanks, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)
pkg/util/log/eventpb/sql_audit_events.go, line 63 at r2 (raw file):
func (d *CommonTxnRowsLimitDetails) Error(kind string) string { return fmt.Sprintf( "txn reached the number of rows %s (%d): TxnID %v SessionID %v",
nit: Now that we're using NumRows
instead of LimitValue
we should change this message to something more like "txn %s %s rows, which is above the limit: TxnID %v SessionID %v"
.
pkg/util/log/eventpb/sql_audit_events.go, line 79 at r2 (raw file):
if p.Detail() { p.Printf( "txn reached the number of rows %s (%d): TxnID %v SessionID %v",
Same nit here (message should be changed now that we're using NumRows
instead of LimitValue
).
pkg/util/log/eventpb/sql_audit_events.proto, line 131 at r2 (raw file):
// SessionID is the ID of the session that initiated the transaction. string session_id = 2 [(gogoproto.customname) = "SessionID", (gogoproto.jsontag) = ",omitempty", (gogoproto.moretags) = "redact:\"nonsensitive\""]; // NumRows is the number of rows written/read (depending on the context) by
nit: Instead of "context" could we say "event type"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)
pkg/sql/conn_executor_exec.go, line 1102 at r2 (raw file):
} const writtenKind = "written"
nit: Is this used anywhere? I would be fine with using string literals directly in both SafeFormatError
methods. Also if you take my message-changing suggestion then you should also change this to "wrote" 🙂
Release note (ops change): the meaning of the recently introduced `transaction_rows_written_err` and `transaction_rows_read_err` (as well as the corresponding `_log` variables) have been adjusted a bit to indicate the largest number of rows that is still allowed. In other words, originally reaching the limit would result in an error, and now only exceeding the limit would. Release justification: low-risk adjustment to new functionality.
33e1340
to
363b3ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @michae2)
pkg/sql/conn_executor_exec.go, line 1102 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
nit: Is this used anywhere? I would be fine with using string literals directly in both
SafeFormatError
methods. Also if you take my message-changing suggestion then you should also change this to "wrote" 🙂
Inlined the literals but kept "written" by rephrasing your suggestion.
pkg/util/log/eventpb/sql_audit_events.go, line 63 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
nit: Now that we're using
NumRows
instead ofLimitValue
we should change this message to something more like"txn %s %s rows, which is above the limit: TxnID %v SessionID %v"
.
Good point, done. I added another commit to adjust the meaning of the rows written/read guardrails to be more intuitive (which allow for being precise while using your suggestion).
This commit refactors the recently introduced `CommonTxnRowsLimitDetails` protobuf struct to make it more user-friendly (since it is used for logging and errors). Namely, the following changes are made: - `Limit` is changed to `NumRows` making it easier to understand - `ViolatesTxnRowsLimitErr` is removed since it is confusing - `IsRead` is removed too since it is redundant (the types of logged events contain the necessary "kind" information and this commit extends the logic of errors to contain the "kind" info too) and is confusing. Additionally, we will no longer log twice if `_err` guardrail is reached after `_log` was reached previously. Release note: None (no release since this was introduced) Release justification: an update to the new functionality.
363b3ab
to
3460a19
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz)
I think the concerns raised by Raphael have been addressed, so I'm merging this. TFTR! bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the logging event, but I just noticed something
- incorrect/superfluous use of
p.Detail
in aSafeFormat
method - naming for the cluster settings
Reviewed 11 of 11 files at r3, 8 of 8 files at r4, 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/conn_executor_exec.go, line 1110 at r5 (raw file):
// implements. func (e *txnRowsWrittenLimitErr) Error() string { return e.CommonTxnRowsLimitDetails.Error("written")
return errors.Sprintf("%v", e)
pkg/sql/exec_util.go, line 602 at r3 (raw file):
var txnRowsWrittenLog = settings.RegisterIntSetting( "sql.defaults.transaction_rows_written_log",
sql.defaults.transaction_max_rows.writes.logging_threshold
pkg/sql/exec_util.go, line 611 at r3 (raw file):
var txnRowsWrittenErr = settings.RegisterIntSetting( "sql.defaults.transaction_rows_written_err",
sql.defaults.transaction_max_rows.writes.force_abort_threshold
pkg/sql/exec_util.go, line 620 at r3 (raw file):
var txnRowsReadLog = settings.RegisterIntSetting( "sql.defaults.transaction_rows_read_log",
sql.defaults.transaction_max_rows.reads.logging_threshold
pkg/sql/exec_util.go, line 629 at r3 (raw file):
var txnRowsReadErr = settings.RegisterIntSetting( "sql.defaults.transaction_rows_read_err",
sql.defaults.transaction_max_rows.reads.force_abort_threshold
pkg/util/log/eventpb/sql_audit_events.go, line 73 at r3 (raw file):
// Error is part of the error interface, which CommonTxnRowsLimitDetails // implements. func (d *CommonTxnRowsLimitDetails) Error() string {
simpler: return fmt.Sprintf("%v", d)
which will automatically cascade to Format
and SafeFormatError
below.
pkg/util/log/eventpb/sql_audit_events.go, line 95 at r3 (raw file):
// CommonTxnRowsLimitDetails implements. func (d *CommonTxnRowsLimitDetails) SafeFormatError(p errors.Printer) (next error) { if p.Detail() {
I'd remove if p.Detail()
since the details are meant to be included in the base case.
pkg/util/log/eventpb/sql_audit_events.go, line 98 at r3 (raw file):
p.Printf( "txn exceeded the number of rows %s (%d): TxnID %v SessionID %v", d.kind(), d.Limit, redact.SafeString(d.TxnID), redact.SafeString(d.SessionID),
I'm pretty sure you do not need redact.SafeString
here? I thought the transaction and session IDs are already safe types?
Build succeeded: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/sql/exec_util.go, line 602 at r3 (raw file):
Previously, knz (kena) wrote…
sql.defaults.transaction_max_rows.writes.logging_threshold
This would make the corresponding session setting transaction_max_rows.writes.logging_threshold
which would be the first session setting with periods in its name.
Some background: the names originally came from this document and were then discussed here and here and here. We decided to keep these names to avoid adding session settings with periods.
These cluster settings were not named in isolation, they match:
- the corresponding session settings (e.g.
transaction_rows_written_log
) - the corresponding metrics (e.g.
sql.guardrails.transaction_rows_written_log.count
) - the large row cluster settings (e.g.
sql.guardrails.max_row_size_log
) which were renamed to match these names - the large row metrics (e.g.
sql.guardrails.max_row_size_log.count
)
We've already had sign-offs from @vy-ton and @rytaft on the existing names, and the customer who requested the feature has already seen them. And I've already started backporting the large row guardrails. I'm reluctant for us to do additional bikeshedding at this point.
pkg/util/log/eventpb/sql_audit_events.go, line 98 at r3 (raw file):
Previously, knz (kena) wrote…
I'm pretty sure you do not need
redact.SafeString
here? I thought the transaction and session IDs are already safe types?
TxnID
and SessionID
are plain old strings in CommonTxnRowsLimitDetails
.
ok thanks for explaining |
I'm still interested in a followup to the |
I'm creating a PR to remove |
Addresses: cockroachdb#67400, cockroachdb#69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in cockroachdb#69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit).
69300: jobs: retry non-cancelable running and all reverting jobs r=ajwerner a=sajjadrizvi Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact the jobs that are reverting. Fixes: #66685 69982: docs/tech-notes: admission control overview r=sumeerbhola a=sumeerbhola Release justification: Non-production code change. Release note: None 70094: tenantcostserver: fix erroneous panic in tests r=RaduBerinde a=RaduBerinde The test-only code that checks the invariants of the `tenant_usage` table inadvertently panics if the query hits an error (such as one that would be expected if the server is shutting down). We now just log the error instead. Fixes #70089. Release note: None Release justification: non-production code change to fix test failure. 70095: tenantcostclient: restrict allowed configuration from the tenant side r=RaduBerinde a=RaduBerinde This change restricts the configuration of tenant cost control from the tenant side. In the future, we will want to have settings where the values come from the host cluster but we don't have that infrastructure today. With tenants being able to set their own settings, they could easily sabotage the cost control mechanisms. This change restricts the allowed values for the target period and the CPU usage allowance, and fixes the cost model configuration to the default. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers. 70102: sql: clean up large row errors and events r=knz,yuzefovich a=michae2 Addresses: #67400, #69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in #69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit). 70118: kv: lock mutexes for `TxnCoordSender.Epoch()` and `Txn.status()` r=ajwerner a=erikgrinaker ### kvcoord: lock mutex in `TxnCoordSender.Epoch()` Methods that access `TxnCoordSender.mu` fields must lock the mutex first. `Epoch()` didn't. Resolves #70071. Release note: None ### kv: fix mutex locking for `Txn.status` `Txn.status()` fetches the transaction status from the mutex-protected `Txn.mu.sender` field, but callers did not take out the mutex lock when calling it. This patch renames the method to `Txn.statusLocked()`, and updates all callers to take out the lock before calling it. Release note: None Co-authored-by: Sajjad Rizvi <sajjad@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com> Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Addresses: cockroachdb#67400, cockroachdb#69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in cockroachdb#69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release justification: low-risk update to new functionality. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit).
Addresses: cockroachdb#67400, cockroachdb#69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in cockroachdb#69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit).
sql: adjust txn rows written/read guardrails a bit
Release note (ops change): the meaning of the recently introduced
transaction_rows_written_err
andtransaction_rows_read_err
(as wellas the corresponding
_log
variables) have been adjusted a bit toindicate the largest number of rows that is still allowed. In other
words, originally reaching the limit would result in an error, and now
only exceeding the limit would.
Release justification: low-risk adjustment to new functionality.
sql: adjust recently added logging events for txn row count guardrails
This commit refactors the recently introduced
CommonTxnRowsLimitDetails
protobuf struct to make it moreuser-friendly (since it is used for logging and errors). Namely, the
following changes are made:
Limit
is changed toNumRows
making it easier to understandViolatesTxnRowsLimitErr
is removed since it is confusingIsRead
is removed too since it is redundant (the types of loggedevents contain the necessary "kind" information and this commit
extends the logic of errors to contain the "kind" info too) and is
confusing.
Additionally, we will no longer log twice if
_err
guardrail is reachedafter
_log
was reached previously.Fixes: #69477.
Release note: None (no release since this was introduced)
Release justification: an update to the new functionality.