Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: adjust recently added logging events for txn row count guardrails #69945

Merged
merged 2 commits into from
Sep 10, 2021

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Sep 8, 2021

sql: adjust txn rows written/read guardrails a bit

Release note (ops change): the meaning of the recently introduced
transaction_rows_written_err and transaction_rows_read_err (as well
as the corresponding _log variables) have been adjusted a bit to
indicate the largest number of rows that is still allowed. In other
words, originally reaching the limit would result in an error, and now
only exceeding the limit would.

Release justification: low-risk adjustment to new functionality.

sql: adjust recently added logging events for txn row count guardrails

This commit refactors the recently introduced
CommonTxnRowsLimitDetails protobuf struct to make it more
user-friendly (since it is used for logging and errors). Namely, the
following changes are made:

  • Limit is changed to NumRows making it easier to understand
  • ViolatesTxnRowsLimitErr is removed since it is confusing
  • IsRead is removed too since it is redundant (the types of logged
    events contain the necessary "kind" information and this commit
    extends the logic of errors to contain the "kind" info too) and is
    confusing.

Additionally, we will no longer log twice if _err guardrail is reached
after _log was reached previously.

Fixes: #69477.

Release note: None (no release since this was introduced)

Release justification: an update to the new functionality.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)


pkg/sql/conn_executor_exec.go, line 1168 at r1 (raw file):

// handleTxnRowsGuardrails handles either "written" or "read" rows guardrails.
func (ex *connExecutor) handleTxnRowsGuardrails(

I've stared at this method for quite some time, but I don't see how to make it less complex without duplicating logic (i.e. extracting separate methods for "external" vs "internal" executors, etc), so I'm looking forward to suggestions on how to improve here. Any comments that I could think of seemed quite obvious to me (maybe because I stared for too long).


pkg/util/log/eventpb/sql_audit_events.proto, line 133 at r1 (raw file):

  // LimitValue indicates the value of the transaction row count limit that was
  // reached.
  int64 limit_value = 3 [(gogoproto.jsontag) = ",omitempty"];

I removed two of the fields Raphael mentioned in the issue, but I think it is worth keeping the "limit" field, so I renamed it, but I'm open to suggestions on the naming and/or comments.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that IsRead was redundant with the event types, but I think I might have actually done the opposite 😆 (that is, I would have unified the read/written event types into a single type, and kept a boolean field indicating whether this was rows read or written). That would simplify handleTxnRowsGuardrails a bit and also simplify any readers of the structured events. (But that's pretty subjective, I think what you've done here is also fine. Maybe @knz has a stronger opinion?)

Another idea for simplifying handleTxnRowsGuardrails: replace LimitValue with NumRows which then wouldn't require all the logLimit = and might be easier for a customer to understand.

@knz If you took issue with these structured events, then I wonder what you think of CommonLargeRowDetails used by LargeRow and LargeRowInternal? 🤔 Maybe we should also get rid of ViolatesMaxRowSizeErr in those events?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)


pkg/sql/conn_executor_exec.go, line 1205 at r1 (raw file):

Quoted 9 lines of code…
		if shouldErr {
			// But this time we also reached the error limit, so we want to log
			// an event again (it will have ViolatesTxnRowsLimitErr set to
			// true). Note that we couldn't have reached the error limit when we
			// logged the event the previous time because that would have
			// aborted the execution of the transaction.
			shouldLog = true
			logLimit = errLimit
		} else {

We could consider deleting this logic and moving the rest of this alreadyLogged check up above the executorTypeInternal check. I don't really think we have to log if shouldErr is true. And don't we already have a way to log all query errors?

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. I changed LimitValue to NumRows, but I prefer to keep the distinction between different error types so that the errors were more useful.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @michae2)


pkg/sql/conn_executor_exec.go, line 1168 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I've stared at this method for quite some time, but I don't see how to make it less complex without duplicating logic (i.e. extracting separate methods for "external" vs "internal" executors, etc), so I'm looking forward to suggestions on how to improve here. Any comments that I could think of seemed quite obvious to me (maybe because I stared for too long).

Ok, I think the logic here is now pretty simple.


pkg/sql/conn_executor_exec.go, line 1205 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…
		if shouldErr {
			// But this time we also reached the error limit, so we want to log
			// an event again (it will have ViolatesTxnRowsLimitErr set to
			// true). Note that we couldn't have reached the error limit when we
			// logged the event the previous time because that would have
			// aborted the execution of the transaction.
			shouldLog = true
			logLimit = errLimit
		} else {

We could consider deleting this logic and moving the rest of this alreadyLogged check up above the executorTypeInternal check. I don't really think we have to log if shouldErr is true. And don't we already have a way to log all query errors?

Thanks, done.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)


pkg/util/log/eventpb/sql_audit_events.go, line 63 at r2 (raw file):

func (d *CommonTxnRowsLimitDetails) Error(kind string) string {
	return fmt.Sprintf(
		"txn reached the number of rows %s (%d): TxnID %v SessionID %v",

nit: Now that we're using NumRows instead of LimitValue we should change this message to something more like "txn %s %s rows, which is above the limit: TxnID %v SessionID %v".


pkg/util/log/eventpb/sql_audit_events.go, line 79 at r2 (raw file):

	if p.Detail() {
		p.Printf(
			"txn reached the number of rows %s (%d): TxnID %v SessionID %v",

Same nit here (message should be changed now that we're using NumRows instead of LimitValue).


pkg/util/log/eventpb/sql_audit_events.proto, line 131 at r2 (raw file):

  // SessionID is the ID of the session that initiated the transaction.
  string session_id = 2 [(gogoproto.customname) = "SessionID", (gogoproto.jsontag) = ",omitempty", (gogoproto.moretags) = "redact:\"nonsensitive\""];
  // NumRows is the number of rows written/read (depending on the context) by

nit: Instead of "context" could we say "event type"?

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)


pkg/sql/conn_executor_exec.go, line 1102 at r2 (raw file):

}

const writtenKind = "written"

nit: Is this used anywhere? I would be fine with using string literals directly in both SafeFormatError methods. Also if you take my message-changing suggestion then you should also change this to "wrote" 🙂

Release note (ops change): the meaning of the recently introduced
`transaction_rows_written_err` and `transaction_rows_read_err` (as well
as the corresponding `_log` variables) have been adjusted a bit to
indicate the largest number of rows that is still allowed. In other
words, originally reaching the limit would result in an error, and now
only exceeding the limit would.

Release justification: low-risk adjustment to new functionality.
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @michae2)


pkg/sql/conn_executor_exec.go, line 1102 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Is this used anywhere? I would be fine with using string literals directly in both SafeFormatError methods. Also if you take my message-changing suggestion then you should also change this to "wrote" 🙂

Inlined the literals but kept "written" by rephrasing your suggestion.


pkg/util/log/eventpb/sql_audit_events.go, line 63 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Now that we're using NumRows instead of LimitValue we should change this message to something more like "txn %s %s rows, which is above the limit: TxnID %v SessionID %v".

Good point, done. I added another commit to adjust the meaning of the rows written/read guardrails to be more intuitive (which allow for being precise while using your suggestion).

This commit refactors the recently introduced
`CommonTxnRowsLimitDetails` protobuf struct to make it more
user-friendly (since it is used for logging and errors). Namely, the
following changes are made:
- `Limit` is changed to `NumRows` making it easier to understand
- `ViolatesTxnRowsLimitErr` is removed since it is confusing
- `IsRead` is removed too since it is redundant (the types of logged
events contain the necessary "kind" information and this commit
extends the logic of errors to contain the "kind" info too) and is
confusing.

Additionally, we will no longer log twice if `_err` guardrail is reached
after `_log` was reached previously.

Release note: None (no release since this was introduced)

Release justification: an update to the new functionality.
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)

@yuzefovich
Copy link
Member Author

I think the concerns raised by Raphael have been addressed, so I'm merging this.

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 10, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Sep 10, 2021

Build failed (retrying...):

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the logging event, but I just noticed something

  • incorrect/superfluous use of p.Detail in a SafeFormat method
  • naming for the cluster settings

Reviewed 11 of 11 files at r3, 8 of 8 files at r4, 2 of 2 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/conn_executor_exec.go, line 1110 at r5 (raw file):

// implements.
func (e *txnRowsWrittenLimitErr) Error() string {
	return e.CommonTxnRowsLimitDetails.Error("written")

return errors.Sprintf("%v", e)


pkg/sql/exec_util.go, line 602 at r3 (raw file):

var txnRowsWrittenLog = settings.RegisterIntSetting(
	"sql.defaults.transaction_rows_written_log",

sql.defaults.transaction_max_rows.writes.logging_threshold


pkg/sql/exec_util.go, line 611 at r3 (raw file):

var txnRowsWrittenErr = settings.RegisterIntSetting(
	"sql.defaults.transaction_rows_written_err",

sql.defaults.transaction_max_rows.writes.force_abort_threshold


pkg/sql/exec_util.go, line 620 at r3 (raw file):

var txnRowsReadLog = settings.RegisterIntSetting(
	"sql.defaults.transaction_rows_read_log",

sql.defaults.transaction_max_rows.reads.logging_threshold


pkg/sql/exec_util.go, line 629 at r3 (raw file):

var txnRowsReadErr = settings.RegisterIntSetting(
	"sql.defaults.transaction_rows_read_err",

sql.defaults.transaction_max_rows.reads.force_abort_threshold


pkg/util/log/eventpb/sql_audit_events.go, line 73 at r3 (raw file):

// Error is part of the error interface, which CommonTxnRowsLimitDetails
// implements.
func (d *CommonTxnRowsLimitDetails) Error() string {

simpler: return fmt.Sprintf("%v", d)
which will automatically cascade to Format and SafeFormatError below.


pkg/util/log/eventpb/sql_audit_events.go, line 95 at r3 (raw file):

// CommonTxnRowsLimitDetails implements.
func (d *CommonTxnRowsLimitDetails) SafeFormatError(p errors.Printer) (next error) {
	if p.Detail() {

I'd remove if p.Detail() since the details are meant to be included in the base case.


pkg/util/log/eventpb/sql_audit_events.go, line 98 at r3 (raw file):

		p.Printf(
			"txn exceeded the number of rows %s (%d): TxnID %v SessionID %v",
			d.kind(), d.Limit, redact.SafeString(d.TxnID), redact.SafeString(d.SessionID),

I'm pretty sure you do not need redact.SafeString here? I thought the transaction and session IDs are already safe types?

@craig
Copy link
Contributor

craig bot commented Sep 10, 2021

Build succeeded:

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/exec_util.go, line 602 at r3 (raw file):

Previously, knz (kena) wrote…

sql.defaults.transaction_max_rows.writes.logging_threshold

This would make the corresponding session setting transaction_max_rows.writes.logging_threshold which would be the first session setting with periods in its name.

Some background: the names originally came from this document and were then discussed here and here and here. We decided to keep these names to avoid adding session settings with periods.

These cluster settings were not named in isolation, they match:

  • the corresponding session settings (e.g. transaction_rows_written_log)
  • the corresponding metrics (e.g. sql.guardrails.transaction_rows_written_log.count)
  • the large row cluster settings (e.g. sql.guardrails.max_row_size_log) which were renamed to match these names
  • the large row metrics (e.g. sql.guardrails.max_row_size_log.count)

We've already had sign-offs from @vy-ton and @rytaft on the existing names, and the customer who requested the feature has already seen them. And I've already started backporting the large row guardrails. I'm reluctant for us to do additional bikeshedding at this point.


pkg/util/log/eventpb/sql_audit_events.go, line 98 at r3 (raw file):

Previously, knz (kena) wrote…

I'm pretty sure you do not need redact.SafeString here? I thought the transaction and session IDs are already safe types?

TxnID and SessionID are plain old strings in CommonTxnRowsLimitDetails.

@knz
Copy link
Contributor

knz commented Sep 10, 2021

ok thanks for explaining

@knz
Copy link
Contributor

knz commented Sep 10, 2021

I'm still interested in a followup to the p.Detail call though

@michae2
Copy link
Collaborator

michae2 commented Sep 10, 2021

I'm still interested in a followup to the p.Detail call though

I'm creating a PR to remove CommonLargeRowDetails.ViolatesMaxRowSizeErr, I will take a look at it in that PR, since CommonLargeRowDetails.SafeFormatError likely has the same issues.

michae2 added a commit to michae2/cockroach that referenced this pull request Sep 13, 2021
Addresses: cockroachdb#67400, cockroachdb#69477

Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done
for CommonTxnRowsLimitDetails in cockroachdb#69945.

Also remove the SafeDetails methods from CommonLargeRowDetails,
txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we
need them.

Release note: None (there was no release between the introduction of the
LargeRow and LargeRowInternal events and this commit).
craig bot pushed a commit that referenced this pull request Sep 13, 2021
69300: jobs: retry non-cancelable running and all reverting jobs r=ajwerner a=sajjadrizvi

Previously, non-cancelable jobs were retried in running state
only if their errors were marked as retryable. Moreover,  only 
non-cancelable reverting jobs were retried by default. This 
commit makes non-cancelable jobs always retry in running 
state unless their error is marked as a permanent error. In
addition, this commit makes all reverting jobs to retry when
they fail. As a result, non-cancelable running jobs and all
reverting jobs do not fail due to transient errors.

Release justification: low-risk updates to new functionality.

Release note (general change): Non-cancelable jobs, such as
schema-change GC jobs, now do not fail unless they fail with
a permanent error. They retry with exponential-backoff if
they fail due to a transient error. Furthermore, Jobs that 
perform reverting tasks do not fail. Instead, they are retried 
with exponential-backoff if an error is encountered while 
reverting. As a result, transient errors do not impact the jobs 
that are reverting.

Fixes: #66685

69982: docs/tech-notes: admission control overview r=sumeerbhola a=sumeerbhola

Release justification: Non-production code change.
Release note: None

70094: tenantcostserver: fix erroneous panic in tests r=RaduBerinde a=RaduBerinde

The test-only code that checks the invariants of the `tenant_usage`
table inadvertently panics if the query hits an error (such as one
that would be expected if the server is shutting down). We now just
log the error instead.

Fixes #70089.

Release note: None

Release justification: non-production code change to fix test failure.

70095: tenantcostclient: restrict allowed configuration from the tenant side r=RaduBerinde a=RaduBerinde

This change restricts the configuration of tenant cost control from
the tenant side. In the future, we will want to have settings where
the values come from the host cluster but we don't have that
infrastructure today.

With tenants being able to set their own settings, they could easily
sabotage the cost control mechanisms. This change restricts the
allowed values for the target period and the CPU usage allowance, and
fixes the cost model configuration to the default.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

70102: sql: clean up large row errors and events r=knz,yuzefovich a=michae2

Addresses: #67400, #69477

Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done
for CommonTxnRowsLimitDetails in #69945.

Also remove the SafeDetails methods from CommonLargeRowDetails,
txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we
need them.

Release note: None (there was no release between the introduction of the
LargeRow and LargeRowInternal events and this commit).

70118: kv: lock mutexes for `TxnCoordSender.Epoch()` and `Txn.status()` r=ajwerner a=erikgrinaker

### kvcoord: lock mutex in `TxnCoordSender.Epoch()`

Methods that access `TxnCoordSender.mu` fields must lock the mutex
first. `Epoch()` didn't.

Resolves #70071.

Release note: None

### kv: fix mutex locking for `Txn.status`

`Txn.status()` fetches the transaction status from the mutex-protected
`Txn.mu.sender` field, but callers did not take out the mutex lock when
calling it.

This patch renames the method to `Txn.statusLocked()`, and updates all
callers to take out the lock before calling it.

Release note: None

Co-authored-by: Sajjad Rizvi <sajjad@cockroachlabs.com>
Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
michae2 added a commit to michae2/cockroach that referenced this pull request Sep 13, 2021
Addresses: cockroachdb#67400, cockroachdb#69477

Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done
for CommonTxnRowsLimitDetails in cockroachdb#69945.

Also remove the SafeDetails methods from CommonLargeRowDetails,
txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we
need them.

Release justification: low-risk update to new functionality.

Release note: None (there was no release between the introduction of the
LargeRow and LargeRowInternal events and this commit).
yuzefovich pushed a commit to yuzefovich/cockroach that referenced this pull request Sep 16, 2021
Addresses: cockroachdb#67400, cockroachdb#69477

Remove ViolatesMaxRowSizeErr from CommonLargeRowDetails, as was done
for CommonTxnRowsLimitDetails in cockroachdb#69945.

Also remove the SafeDetails methods from CommonLargeRowDetails,
txnRowsReadLimitErr, and txnRowsWrittenLimitErr, as I don't think we
need them.

Release note: None (there was no release between the introduction of the
LargeRow and LargeRowInternal events and this commit).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants