sql: adjust recently added logging events for txn row count guardrails #69945

yuzefovich · 2021-09-08T22:58:39Z

sql: adjust txn rows written/read guardrails a bit

Release note (ops change): the meaning of the recently introduced
transaction_rows_written_err and transaction_rows_read_err (as well
as the corresponding _log variables) have been adjusted a bit to
indicate the largest number of rows that is still allowed. In other
words, originally reaching the limit would result in an error, and now
only exceeding the limit would.

Release justification: low-risk adjustment to new functionality.

sql: adjust recently added logging events for txn row count guardrails

This commit refactors the recently introduced
CommonTxnRowsLimitDetails protobuf struct to make it more
user-friendly (since it is used for logging and errors). Namely, the
following changes are made:

Limit is changed to NumRows making it easier to understand
ViolatesTxnRowsLimitErr is removed since it is confusing
IsRead is removed too since it is redundant (the types of logged
events contain the necessary "kind" information and this commit
extends the logic of errors to contain the "kind" info too) and is
confusing.

Additionally, we will no longer log twice if _err guardrail is reached
after _log was reached previously.

Fixes: #69477.

Release note: None (no release since this was introduced)

Release justification: an update to the new functionality.

cockroach-teamcity · 2021-09-08T22:58:46Z

This change is

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)

pkg/sql/conn_executor_exec.go, line 1168 at r1 (raw file):

// handleTxnRowsGuardrails handles either "written" or "read" rows guardrails.
func (ex *connExecutor) handleTxnRowsGuardrails(

I've stared at this method for quite some time, but I don't see how to make it less complex without duplicating logic (i.e. extracting separate methods for "external" vs "internal" executors, etc), so I'm looking forward to suggestions on how to improve here. Any comments that I could think of seemed quite obvious to me (maybe because I stared for too long).

pkg/util/log/eventpb/sql_audit_events.proto, line 133 at r1 (raw file):

  // LimitValue indicates the value of the transaction row count limit that was
  // reached.
  int64 limit_value = 3 [(gogoproto.jsontag) = ",omitempty"];

I removed two of the fields Raphael mentioned in the issue, but I think it is worth keeping the "limit" field, so I renamed it, but I'm open to suggestions on the naming and/or comments.

michae2

I agree that IsRead was redundant with the event types, but I think I might have actually done the opposite 😆 (that is, I would have unified the read/written event types into a single type, and kept a boolean field indicating whether this was rows read or written). That would simplify handleTxnRowsGuardrails a bit and also simplify any readers of the structured events. (But that's pretty subjective, I think what you've done here is also fine. Maybe @knz has a stronger opinion?)

Another idea for simplifying handleTxnRowsGuardrails: replace LimitValue with NumRows which then wouldn't require all the logLimit = and might be easier for a customer to understand.

@knz If you took issue with these structured events, then I wonder what you think of CommonLargeRowDetails used by LargeRow and LargeRowInternal? 🤔 Maybe we should also get rid of ViolatesMaxRowSizeErr in those events?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)

pkg/sql/conn_executor_exec.go, line 1205 at r1 (raw file):

Quoted 9 lines of code…

		if shouldErr {
			// But this time we also reached the error limit, so we want to log
			// an event again (it will have ViolatesTxnRowsLimitErr set to
			// true). Note that we couldn't have reached the error limit when we
			// logged the event the previous time because that would have
			// aborted the execution of the transaction.
			shouldLog = true
			logLimit = errLimit
		} else {

We could consider deleting this logic and moving the rest of this alreadyLogged check up above the executorTypeInternal check. I don't really think we have to log if shouldErr is true. And don't we already have a way to log all query errors?

yuzefovich

Thanks for the suggestions. I changed LimitValue to NumRows, but I prefer to keep the distinction between different error types so that the errors were more useful.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)

pkg/sql/conn_executor_exec.go, line 1168 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I've stared at this method for quite some time, but I don't see how to make it less complex without duplicating logic (i.e. extracting separate methods for "external" vs "internal" executors, etc), so I'm looking forward to suggestions on how to improve here. Any comments that I could think of seemed quite obvious to me (maybe because I stared for too long).

Ok, I think the logic here is now pretty simple.

pkg/sql/conn_executor_exec.go, line 1205 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

		if shouldErr {
			// But this time we also reached the error limit, so we want to log
			// an event again (it will have ViolatesTxnRowsLimitErr set to
			// true). Note that we couldn't have reached the error limit when we
			// logged the event the previous time because that would have
			// aborted the execution of the transaction.
			shouldLog = true
			logLimit = errLimit
		} else {
We could consider deleting this logic and moving the rest of this alreadyLogged check up above the executorTypeInternal check. I don't really think we have to log if shouldErr is true. And don't we already have a way to log all query errors?

Thanks, done.

michae2

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)

pkg/util/log/eventpb/sql_audit_events.go, line 63 at r2 (raw file):

func (d *CommonTxnRowsLimitDetails) Error(kind string) string {
	return fmt.Sprintf(
		"txn reached the number of rows %s (%d): TxnID %v SessionID %v",

nit: Now that we're using NumRows instead of LimitValue we should change this message to something more like "txn %s %s rows, which is above the limit: TxnID %v SessionID %v".

pkg/util/log/eventpb/sql_audit_events.go, line 79 at r2 (raw file):

	if p.Detail() {
		p.Printf(
			"txn reached the number of rows %s (%d): TxnID %v SessionID %v",

Same nit here (message should be changed now that we're using NumRows instead of LimitValue).

pkg/util/log/eventpb/sql_audit_events.proto, line 131 at r2 (raw file):

  // SessionID is the ID of the session that initiated the transaction.
  string session_id = 2 [(gogoproto.customname) = "SessionID", (gogoproto.jsontag) = ",omitempty", (gogoproto.moretags) = "redact:\"nonsensitive\""];
  // NumRows is the number of rows written/read (depending on the context) by

nit: Instead of "context" could we say "event type"?

michae2

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)

pkg/sql/conn_executor_exec.go, line 1102 at r2 (raw file):

}

const writtenKind = "written"

nit: Is this used anywhere? I would be fine with using string literals directly in both SafeFormatError methods. Also if you take my message-changing suggestion then you should also change this to "wrote" 🙂

Release note (ops change): the meaning of the recently introduced `transaction_rows_written_err` and `transaction_rows_read_err` (as well as the corresponding `_log` variables) have been adjusted a bit to indicate the largest number of rows that is still allowed. In other words, originally reaching the limit would result in an error, and now only exceeding the limit would. Release justification: low-risk adjustment to new functionality.

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @michae2)

pkg/sql/conn_executor_exec.go, line 1102 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Is this used anywhere? I would be fine with using string literals directly in both SafeFormatError methods. Also if you take my message-changing suggestion then you should also change this to "wrote" 🙂

Inlined the literals but kept "written" by rephrasing your suggestion.

pkg/util/log/eventpb/sql_audit_events.go, line 63 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Now that we're using NumRows instead of LimitValue we should change this message to something more like "txn %s %s rows, which is above the limit: TxnID %v SessionID %v".

Good point, done. I added another commit to adjust the meaning of the rows written/read guardrails to be more intuitive (which allow for being precise while using your suggestion).

This commit refactors the recently introduced `CommonTxnRowsLimitDetails` protobuf struct to make it more user-friendly (since it is used for logging and errors). Namely, the following changes are made: - `Limit` is changed to `NumRows` making it easier to understand - `ViolatesTxnRowsLimitErr` is removed since it is confusing - `IsRead` is removed too since it is redundant (the types of logged events contain the necessary "kind" information and this commit extends the logic of errors to contain the "kind" info too) and is confusing. Additionally, we will no longer log twice if `_err` guardrail is reached after `_log` was reached previously. Release note: None (no release since this was introduced) Release justification: an update to the new functionality.

michae2

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz)

yuzefovich · 2021-09-09T22:16:37Z

I think the concerns raised by Raphael have been addressed, so I'm merging this.

TFTR!

bors r+

craig · 2021-09-10T02:02:09Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2021-09-10T04:17:44Z

Build failed (retrying...):

GitHub CI (Cockroach)

knz

LGTM on the logging event, but I just noticed something

incorrect/superfluous use of p.Detail in a SafeFormat method
naming for the cluster settings

Reviewed 11 of 11 files at r3, 8 of 8 files at r4, 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)

pkg/sql/conn_executor_exec.go, line 1110 at r5 (raw file):

// implements.
func (e *txnRowsWrittenLimitErr) Error() string {
	return e.CommonTxnRowsLimitDetails.Error("written")

return errors.Sprintf("%v", e)

pkg/sql/exec_util.go, line 602 at r3 (raw file):

var txnRowsWrittenLog = settings.RegisterIntSetting(
	"sql.defaults.transaction_rows_written_log",

sql.defaults.transaction_max_rows.writes.logging_threshold

pkg/sql/exec_util.go, line 611 at r3 (raw file):

var txnRowsWrittenErr = settings.RegisterIntSetting(
	"sql.defaults.transaction_rows_written_err",

sql.defaults.transaction_max_rows.writes.force_abort_threshold

pkg/sql/exec_util.go, line 620 at r3 (raw file):

var txnRowsReadLog = settings.RegisterIntSetting(
	"sql.defaults.transaction_rows_read_log",

sql.defaults.transaction_max_rows.reads.logging_threshold

pkg/sql/exec_util.go, line 629 at r3 (raw file):

var txnRowsReadErr = settings.RegisterIntSetting(
	"sql.defaults.transaction_rows_read_err",

sql.defaults.transaction_max_rows.reads.force_abort_threshold

pkg/util/log/eventpb/sql_audit_events.go, line 73 at r3 (raw file):

// Error is part of the error interface, which CommonTxnRowsLimitDetails
// implements.
func (d *CommonTxnRowsLimitDetails) Error() string {

simpler: return fmt.Sprintf("%v", d)
which will automatically cascade to Format and SafeFormatError below.

pkg/util/log/eventpb/sql_audit_events.go, line 95 at r3 (raw file):

// CommonTxnRowsLimitDetails implements.
func (d *CommonTxnRowsLimitDetails) SafeFormatError(p errors.Printer) (next error) {
	if p.Detail() {

I'd remove if p.Detail() since the details are meant to be included in the base case.

pkg/util/log/eventpb/sql_audit_events.go, line 98 at r3 (raw file):

		p.Printf(
			"txn exceeded the number of rows %s (%d): TxnID %v SessionID %v",
			d.kind(), d.Limit, redact.SafeString(d.TxnID), redact.SafeString(d.SessionID),

I'm pretty sure you do not need redact.SafeString here? I thought the transaction and session IDs are already safe types?

craig · 2021-09-10T10:52:57Z

Build succeeded:

GitHub CI (Cockroach)

michae2

Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/sql/exec_util.go, line 602 at r3 (raw file):

Previously, knz (kena) wrote…

sql.defaults.transaction_max_rows.writes.logging_threshold

This would make the corresponding session setting transaction_max_rows.writes.logging_threshold which would be the first session setting with periods in its name.

Some background: the names originally came from this document and were then discussed here and here and here. We decided to keep these names to avoid adding session settings with periods.

These cluster settings were not named in isolation, they match:

the corresponding session settings (e.g. transaction_rows_written_log)
the corresponding metrics (e.g. sql.guardrails.transaction_rows_written_log.count)
the large row cluster settings (e.g. sql.guardrails.max_row_size_log) which were renamed to match these names
the large row metrics (e.g. sql.guardrails.max_row_size_log.count)

We've already had sign-offs from @vy-ton and @rytaft on the existing names, and the customer who requested the feature has already seen them. And I've already started backporting the large row guardrails. I'm reluctant for us to do additional bikeshedding at this point.

pkg/util/log/eventpb/sql_audit_events.go, line 98 at r3 (raw file):

Previously, knz (kena) wrote…

I'm pretty sure you do not need redact.SafeString here? I thought the transaction and session IDs are already safe types?

TxnID and SessionID are plain old strings in CommonTxnRowsLimitDetails.

knz · 2021-09-10T18:23:26Z

ok thanks for explaining

knz · 2021-09-10T18:23:42Z

I'm still interested in a followup to the p.Detail call though

michae2 · 2021-09-10T18:30:36Z

I'm still interested in a followup to the p.Detail call though

I'm creating a PR to remove CommonLargeRowDetails.ViolatesMaxRowSizeErr, I will take a look at it in that PR, since CommonLargeRowDetails.SafeFormatError likely has the same issues.

Addresses: cockroachdb#67400, cockroachdb#69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in cockroachdb#69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit).

69300: jobs: retry non-cancelable running and all reverting jobs r=ajwerner a=sajjadrizvi Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact the jobs that are reverting. Fixes: #66685 69982: docs/tech-notes: admission control overview r=sumeerbhola a=sumeerbhola Release justification: Non-production code change. Release note: None 70094: tenantcostserver: fix erroneous panic in tests r=RaduBerinde a=RaduBerinde The test-only code that checks the invariants of the `tenant_usage` table inadvertently panics if the query hits an error (such as one that would be expected if the server is shutting down). We now just log the error instead. Fixes #70089. Release note: None Release justification: non-production code change to fix test failure. 70095: tenantcostclient: restrict allowed configuration from the tenant side r=RaduBerinde a=RaduBerinde This change restricts the configuration of tenant cost control from the tenant side. In the future, we will want to have settings where the values come from the host cluster but we don't have that infrastructure today. With tenants being able to set their own settings, they could easily sabotage the cost control mechanisms. This change restricts the allowed values for the target period and the CPU usage allowance, and fixes the cost model configuration to the default. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers. 70102: sql: clean up large row errors and events r=knz,yuzefovich a=michae2 Addresses: #67400, #69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in #69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit). 70118: kv: lock mutexes for `TxnCoordSender.Epoch()` and `Txn.status()` r=ajwerner a=erikgrinaker ### kvcoord: lock mutex in `TxnCoordSender.Epoch()` Methods that access `TxnCoordSender.mu` fields must lock the mutex first. `Epoch()` didn't. Resolves #70071. Release note: None ### kv: fix mutex locking for `Txn.status` `Txn.status()` fetches the transaction status from the mutex-protected `Txn.mu.sender` field, but callers did not take out the mutex lock when calling it. This patch renames the method to `Txn.statusLocked()`, and updates all callers to take out the lock before calling it. Release note: None Co-authored-by: Sajjad Rizvi <sajjad@cockroachlabs.com> Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com> Co-authored-by: Michael Erickson <michae2@cockroachlabs.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

Addresses: cockroachdb#67400, cockroachdb#69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in cockroachdb#69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release justification: low-risk update to new functionality. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit).

Addresses: cockroachdb#67400, cockroachdb#69477 Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done for CommonTxnRowsLimitDetails in cockroachdb#69945. Also remove the SafeDetails methods from CommonLargeRowDetails, txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we need them. Release note: None (there was no release between the introduction of the LargeRow and LargeRowInternal events and this commit).

yuzefovich requested review from knz and michae2 September 8, 2021 22:58

yuzefovich commented Sep 8, 2021

View reviewed changes

michae2 reviewed Sep 9, 2021

View reviewed changes

yuzefovich force-pushed the fix-guardrails branch from ae562cb to 33e1340 Compare September 9, 2021 15:47

yuzefovich commented Sep 9, 2021

View reviewed changes

michae2 approved these changes Sep 9, 2021

View reviewed changes

michae2 reviewed Sep 9, 2021

View reviewed changes

yuzefovich force-pushed the fix-guardrails branch from 33e1340 to 363b3ab Compare September 9, 2021 19:22

yuzefovich requested a review from a team as a code owner September 9, 2021 19:22

yuzefovich commented Sep 9, 2021

View reviewed changes

yuzefovich force-pushed the fix-guardrails branch from 363b3ab to 3460a19 Compare September 9, 2021 20:49

michae2 reviewed Sep 9, 2021

View reviewed changes

yuzefovich added the backport-21.2.x label Sep 9, 2021

knz reviewed Sep 10, 2021

View reviewed changes

craig bot merged commit a321f6d into cockroachdb:master Sep 10, 2021

blathers-crl bot mentioned this pull request Sep 10, 2021

release-21.2: sql: adjust recently added logging events for txn row count guardrails #70014

Merged

cockroach-teamcity mentioned this pull request Sep 10, 2021

sql: adjust recently added logging events for txn row count guardrails cockroachdb/docs#11628

Closed

yuzefovich deleted the fix-guardrails branch September 10, 2021 16:30

michae2 reviewed Sep 10, 2021

View reviewed changes

michae2 mentioned this pull request Sep 13, 2021

sql: clean up large row errors and events #70102

Merged

michae2 mentioned this pull request Sep 13, 2021

release-21.2: sql: clean up large row errors and events #70157

Merged

yuzefovich mentioned this pull request Sep 14, 2021

release-21.1: sql: add txn rows written/read guardrails #70175

Merged

yuzefovich mentioned this pull request Sep 16, 2021

release-21.1: sql: clean up large row errors and events #70322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: adjust recently added logging events for txn row count guardrails #69945

sql: adjust recently added logging events for txn row count guardrails #69945

yuzefovich commented Sep 8, 2021 •

edited

Loading

cockroach-teamcity commented Sep 8, 2021

yuzefovich left a comment

michae2 left a comment

yuzefovich left a comment

michae2 left a comment

michae2 left a comment

yuzefovich left a comment

michae2 left a comment

yuzefovich commented Sep 9, 2021

craig bot commented Sep 10, 2021

craig bot commented Sep 10, 2021

knz left a comment

craig bot commented Sep 10, 2021

michae2 left a comment

knz commented Sep 10, 2021

knz commented Sep 10, 2021

michae2 commented Sep 10, 2021

sql: adjust recently added logging events for txn row count guardrails #69945

sql: adjust recently added logging events for txn row count guardrails #69945

Conversation

yuzefovich commented Sep 8, 2021 • edited Loading

cockroach-teamcity commented Sep 8, 2021

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich commented Sep 9, 2021

craig bot commented Sep 10, 2021

craig bot commented Sep 10, 2021

knz left a comment

Choose a reason for hiding this comment

craig bot commented Sep 10, 2021

michae2 left a comment

Choose a reason for hiding this comment

knz commented Sep 10, 2021

knz commented Sep 10, 2021

michae2 commented Sep 10, 2021

yuzefovich commented Sep 8, 2021 •

edited

Loading