Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: follower-reads/survival=region/locality=regional/reads=bounded-staleness failed #70110

Closed
cockroach-teamcity opened this issue Sep 13, 2021 · 17 comments
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 13, 2021

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 8d491ced731e13da2377cda9961f577d7487d6a0:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:657,follower_reads.go:319,follower_reads.go:57,test_runner.go:777: too many intervals with more than 2 nodes with low follower read ratios: 5 intervals > 4 threshold. Bad intervals:
		interval 09:39:50-09:40:00: n1 ratio: 0.229 n2 ratio: 0.000 n3 ratio: 0.132 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:40:30-09:40:40: n1 ratio: 0.342 n2 ratio: 0.163 n3 ratio: 0.795 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:41:10-09:41:20: n1 ratio: 0.187 n2 ratio: 0.000 n3 ratio: 0.812 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.994 
		interval 09:41:30-09:41:40: n1 ratio: 0.000 n2 ratio: 0.000 n3 ratio: 0.812 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:42:00-09:42:10: n1 ratio: 0.000 n2 ratio: 0.000 n3 ratio: 0.767 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Epic CRDB-10559

@cockroach-teamcity cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 13, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ ed81878bb4697c3a0a09e69b8a5053566297b142:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:657,follower_reads.go:319,follower_reads.go:57,test_runner.go:777: too many intervals with more than 2 nodes with low follower read ratios: 5 intervals > 4 threshold. Bad intervals:
		interval 09:44:10-09:44:20: n1 ratio: 0.599 n2 ratio: 0.000 n3 ratio: 1.378 n4 ratio: 0.831 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:44:20-09:44:30: n1 ratio: 2.254 n2 ratio: 0.409 n3 ratio: 0.509 n4 ratio: 0.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 09:45:20-09:45:30: n1 ratio: 0.069 n2 ratio: 1.000 n3 ratio: 0.170 n4 ratio: 1.000 n5 ratio: 0.802 n6 ratio: 1.000 
		interval 09:46:10-09:46:20: n1 ratio: 0.221 n2 ratio: 1.187 n3 ratio: 0.146 n4 ratio: 1.000 n5 ratio: 0.818 n6 ratio: 1.000 
		interval 09:47:10-09:47:20: n1 ratio: 0.013 n2 ratio: 1.000 n3 ratio: 0.159 n4 ratio: 1.000 n5 ratio: 0.792 n6 ratio: 1.000 
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 0babf97f52ed9e44036851b2a9868e17eeee95ed:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:657,follower_reads.go:319,follower_reads.go:57,test_runner.go:777: too many intervals with more than 2 nodes with low follower read ratios: 5 intervals > 4 threshold. Bad intervals:
		interval 10:05:40-10:05:50: n1 ratio: 0.219 n2 ratio: 0.086 n3 ratio: 0.131 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 1.000 
		interval 10:06:00-10:06:10: n1 ratio: 0.000 n2 ratio: 1.179 n3 ratio: 0.000 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.827 
		interval 10:06:10-10:06:20: n1 ratio: 0.183 n2 ratio: 1.000 n3 ratio: 0.480 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.859 
		interval 10:07:10-10:07:20: n1 ratio: 0.457 n2 ratio: 1.100 n3 ratio: 0.000 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.640 
		interval 10:08:10-10:08:20: n1 ratio: 0.162 n2 ratio: 1.000 n3 ratio: 0.418 n4 ratio: 1.000 n5 ratio: 1.000 n6 ratio: 0.791 
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@nvanbenschoten nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 21, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 95c3da44a6c38aa1ae22fe9de5549632b7527aa5:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:545,follower_reads.go:327,follower_reads.go:57,test_runner.go:777: 11 latency values ([20.840442ms 20.840442ms 20.316154ms 20.250618ms 20.250618ms 20.447226ms 20.512762ms 20.840442ms 21.299194ms 20.643834ms 20.643834ms]) are above target latency 20ms, 4 permitted
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 82e8782453b1cf14460d93b6bf8328a7b2964575:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:545,follower_reads.go:327,follower_reads.go:57,test_runner.go:777: 22 latency values ([22.544378ms 21.889018ms 22.478842ms 22.544378ms 22.544378ms 22.02009ms 22.02009ms 21.495802ms 21.430266ms 20.971514ms 20.38169ms 20.971514ms 21.495802ms 21.561338ms 21.561338ms 21.561338ms 21.03705ms 21.03705ms 21.430266ms 21.430266ms 21.561338ms 21.561338ms]) are above target latency 20ms, 4 permitted
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 3cf3b0ea3b08cb24e4d6b84c6c237b856ce6b411:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:545,follower_reads.go:327,follower_reads.go:57,test_runner.go:777: 22 latency values ([21.495802ms 21.626874ms 21.561338ms 21.69241ms 21.495802ms 21.823482ms 21.823482ms 21.823482ms 21.889018ms 21.954554ms 21.430266ms 21.299194ms 21.430266ms 21.561338ms 21.561338ms 22.085626ms 21.823482ms 21.823482ms 22.34777ms 22.216698ms 22.282234ms 21.757946ms]) are above target latency 20ms, 4 permitted
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@exalate-issue-sync exalate-issue-sync bot added the T-kv KV Team label Oct 11, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/survival=region/locality=regional/reads=bounded-staleness failed with artifacts on release-21.2 @ 2ef24098cc88bab3958ad905f46b0ed7e74bd7db:

The test failed on branch=release-21.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/follower-reads/survival=region/locality=regional/reads=bounded-staleness/run_1
	follower_reads.go:545,follower_reads.go:327,follower_reads.go:57,test_runner.go:777: 14 latency values ([20.447226ms 20.05401ms 21.233658ms 21.102586ms 20.70937ms 20.70937ms 20.774906ms 20.774906ms 20.971514ms 21.102586ms 21.495802ms 21.495802ms 21.03705ms 20.971514ms]) are above target latency 20ms, 4 permitted
Reproduce

See: roachtest README

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@nvanbenschoten
Copy link
Member

My bisect for this just completed, and is pointing pretty conclusively at 5519e6d, which landed on the release-21.2 branch in #70759, 15 days ago. That lines up with the timing of this failure. Interestingly, it also lines up with the timing of #70977. We left that test with an understanding of how to improve the situation but without an understanding of what had caused the regression in the first place.

So what's the common thread between this commit and these two tests? Could they all be related? I think so. The common thread is that both of these tests enable tracing:

kv/contention/nodes=4

SET CLUSTER SETTING trace.debug.enable = true;

follower-reads/survival=region/locality=regional/reads=bounded-staleness

_, err := db.ExecContext(ctx, "SET CLUSTER SETTING sql.trace.stmt.enable_threshold = '50ms'")

And the commit in question modified tracing in a way that likely made it more expensive. @tbg do you have thoughts on why this commit would make tracing more expensive, and how much more expensive?

As a small additional data point (which I'm surprised I still had lying around), while investigating #70977 I did grab the following CPU profile at one point, which shows a significant amount of time in log.vEventf:

Screen Shot 2021-10-08 at 12 51 52 PM

@tbg
Copy link
Member

tbg commented Oct 13, 2021

If we wanted to fix this (I think we have to, as some users are likely running with tracing on across the board):

// LogRecord is a log message recorded in a traced span.
type LogRecord struct {
// Time of the log record.
Time time.Time `protobuf:"bytes,1,opt,name=time,proto3,stdtime" json:"time"`
// Fields with values converted to strings. In 22.1, the `message` field
// contains the log message, and this field is only used for compatibility
// with 21.2 nodes.
DeprecatedFields []LogRecord_Field `protobuf:"bytes,2,rep,name=deprecated_fields,json=deprecatedFields,proto3" json:"deprecated_fields"`
// The log message.
Message github_com_cockroachdb_redact.RedactableString `protobuf:"bytes,3,opt,name=message,proto3,customtype=github.com/cockroachdb/redact.RedactableString" json:"message"`
}

We need to introduce two Message fields here, one for redacted information and one for unredacted information. We also add a cluster setting that determines whether traces are to be redacted. Depending on the setting, we populate one field or the other (never both). The code that returns the contents of the verbose trace selects whichever field is nonzero.

@tbg
Copy link
Member

tbg commented Oct 13, 2021

Note that the cluster setting would be KV-level. Tenants don't get the choice - they get whatever the KV-side setting is and need to live with that. In the future, tenants may also redact their own traces and use the setting for that, but none of the infra exists. The only redaction that exists right now is at the BatchRequest boundary on the KV side:

cockroach/pkg/server/node.go

Lines 1025 to 1040 in a814577

if grpcSpan != nil {
// If our local span descends from a parent on the other
// end of the RPC (i.e. the !isLocalRequest) case,
// attach the span recording to the batch response.
// Tenants get a redacted recording, i.e. with anything
// sensitive stripped out of the verbose messages. However,
// structured payloads stay untouched.
if rec := grpcSpan.GetRecording(); rec != nil {
err := redactRecordingForTenant(tenID, rec)
if err == nil {
br.CollectedSpans = append(br.CollectedSpans, rec...)
} else {
log.Errorf(ctx, "error redacting trace recording: %s", err)
}
}
}

If the cluster setting is disabled, tenants will get nothing from the KV layer except for messages that have been redacted in their entirety (i.e. contain no information except the timestamp). So in Serverless, we expect to run with redaction enabled.

@andy-kimball
Copy link
Contributor

Can we back off the level and/or frequency of tracing to address this?

CC @knz, @logston

@logston
Copy link
Contributor

logston commented Oct 13, 2021

I am 👍 for reducing the frequency of tracing to address this. However, I need @knz to confirm that such a reduction would adequately solve the issue as they are more familiar this logic than I.

@andy-kimball
Copy link
Contributor

@tbg, what's missing here is an explanation of why your change caused these regressions, as well as what the impact on customers would be if we do whatever it is you're suggesting. Can you give a more detailed analysis that doesn't assume knowledge of this code?

@tbg
Copy link
Member

tbg commented Oct 14, 2021

The regression here is for verbose tracing and is due to the newly added functionality to redact log messages. I'm unsure at what level Serverless uses tracing - ideally not at all, and ideally tenants also cannot turn it on for all of their transactions. Turning tracing on globally was never a good idea due to the overhead, but some customers get away with it if they have enough headroom.

If we turn off redactability (we can do that by reverting the PR that introduces it) we have to in turn completely blank out the KV side of traces (for tenants). I was originally trying to avoid this but at this point that seems like the way to go, at least until we can understand how expensive redaction is better.

Even with that done, it remains useful to restrict how much tenants can enable verbose tracing (cluster settings or even just SET tracing = on in each session). This has always been expensive and while limited to a few tenants each it's unlikely to overwhelm our system in the same way it could do on a dedicated cluster, it's something to keep an eye on. My suggestion is that if we're going to strip the KV-side traces anyway, we may as well make them non-verbose, so that the perf overhead is limited to the SQL pods, where it affects the tenant only.

In summary planning to do the following tomorrow:

@andreimatei
Copy link
Contributor

I think we should have some sort of measurement of the performance impact of the redaction in case verbose tracing is on before reverting. From the profile that Nathan posted, it doesn't look that great... Some latency percentile in the follower reads test has shifted - which sounds bad - but I think we should quantify by how much.

@tbg
Copy link
Member

tbg commented Oct 15, 2021

I think it might have to be the opposite. If we sit down now and take measurements on the regression and discuss what is acceptable and what isn't, all while we are afraid to release, we are stopping the train for no reason. (Note that we're not primarly afraid about serverless, but for anyone who will run 21.2 GA). We should move the problem out of the way and then quantify the overhead of tracing.

tbg added a commit to tbg/cockroach that referenced this issue Oct 15, 2021
When we discovered that tenants could "trace into" KV, we considered this a
security problem since no mechanism ensured that tenants could receive data
they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes
all verbose traces *redactable* and by redacting them at the KV-Tenant boundary.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled) and it does not seem likely that we can get
it to the low single digits, at any rate not in time for 21.2.0.
This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

We still slightly regress in the (fairly sensitive) "select single row"
benchmark, but it's a lot more palatable now:

```
$ benchstat old.txt new.txt
name                                           old time/op    new time/op    delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16    1.06ms ± 1%    1.07ms ± 1%  +1.25%  (p=0.002 n=7+7)

name                                           old alloc/op   new alloc/op   delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     155kB ± 2%     164kB ± 1%  +5.50%  (p=0.001 n=7+6)

name                                           old allocs/op  new allocs/op  delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     2.27k ± 1%     2.33k ± 1%  +2.39%  (p=0.001 n=7+7)
```

I did take a quick glance at the CPU profiles but nothing obvious turned
up. I might have to look at an allocation profile.

Touches cockroachdb#70562.

[significant]: cockroachdb#70110 (comment)

Release note: None
@tbg
Copy link
Member

tbg commented Oct 19, 2021

There's a bit of the perf measurements over in #71610 now. I'd say it's pretty bad. Anyway we have moved to revert anyway, so the rest of the action is going to be on master.

@tbg
Copy link
Member

tbg commented Oct 20, 2021

The perf regression has been removed on 21.2. It is still present on master here.

@tbg tbg closed this as completed Oct 20, 2021
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Nov 24, 2021
When we discovered that tenants could "trace into" KV, we considered this a
security problem since no mechanism ensured that tenants could receive data
they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes
all verbose traces *redactable* and by redacting them at the KV-Tenant boundary.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled) and it does not seem likely that we can get
it to the low single digits, at any rate not in time for 21.2.0.
This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

We still slightly regress in the (fairly sensitive) "select single row"
benchmark, but it's a lot more palatable now:

```
$ benchstat old.txt new.txt
name                                           old time/op    new time/op    delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16    1.06ms ± 1%    1.07ms ± 1%  +1.25%  (p=0.002 n=7+7)

name                                           old alloc/op   new alloc/op   delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     155kB ± 2%     164kB ± 1%  +5.50%  (p=0.001 n=7+6)

name                                           old allocs/op  new allocs/op  delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     2.27k ± 1%     2.33k ± 1%  +2.39%  (p=0.001 n=7+7)
```

I did take a quick glance at the CPU profiles but nothing obvious turned
up. I might have to look at an allocation profile.

Touches cockroachdb#70562.

[significant]: cockroachdb#70110 (comment)

Release note: None
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Dec 2, 2021
When we discovered that tenants could "trace into" KV, we considered this a
security problem since no mechanism ensured that tenants could receive data
they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes
all verbose traces *redactable* and by redacting them at the KV-Tenant boundary.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled) and it does not seem likely that we can get
it to the low single digits, at any rate not in time for 21.2.0.
This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

We still slightly regress in the (fairly sensitive) "select single row"
benchmark, but it's a lot more palatable now:

```
$ benchstat old.txt new.txt
name                                           old time/op    new time/op    delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16    1.06ms ± 1%    1.07ms ± 1%  +1.25%  (p=0.002 n=7+7)

name                                           old alloc/op   new alloc/op   delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     155kB ± 2%     164kB ± 1%  +5.50%  (p=0.001 n=7+6)

name                                           old allocs/op  new allocs/op  delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     2.27k ± 1%     2.33k ± 1%  +2.39%  (p=0.001 n=7+7)
```

I did take a quick glance at the CPU profiles but nothing obvious turned
up. I might have to look at an allocation profile.

Touches cockroachdb#70562.

[significant]: cockroachdb#70110 (comment)

Release note: None
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Dec 3, 2021
When we discovered that tenants could "trace into" KV, we considered this a
security problem since no mechanism ensured that tenants could receive data
they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes
all verbose traces *redactable* and by redacting them at the KV-Tenant boundary.

Serverless has shipped with this redactability in place in order to
prevent data leakage across the boundary with KV and now needs to get
back on the `release-21.2` branch to sync with the rest of development.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled).

This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes
via the additional costs of redacting log strings via the
`redact.Sprintf` function.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

We still slightly regress in the (fairly sensitive) "select single row"
benchmark, but it's a lot more palatable now:

```
$ benchstat old.txt new.txt
name                                           old time/op    new time/op    delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16    1.06ms ± 1%    1.07ms ± 1%  +1.25%  (p=0.002 n=7+7)

name                                           old alloc/op   new alloc/op   delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     155kB ± 2%     164kB ± 1%  +5.50%  (p=0.001 n=7+6)

name                                           old allocs/op  new allocs/op  delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     2.27k ± 1%     2.33k ± 1%  +2.39%  (p=0.001 n=7+7)
```

I did take a quick glance at the CPU profiles but nothing obvious turned
up. I might have to look at an allocation profile.

Touches cockroachdb#70562.

Original issue investigation the kicked off concerns about performance
impact of redaction:
cockroachdb#70110 (comment)

Release note: None
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Dec 3, 2021
When we discovered that tenants could "trace into" KV, we considered this a
security problem since no mechanism ensured that tenants could receive data
they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes
all verbose traces *redactable* and by redacting them at the KV-Tenant boundary.

Serverless has shipped with this redactability in place in order to
prevent data leakage across the boundary with KV and now needs to get
back on the `release-21.2` branch to sync with the rest of development.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled).

This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes
via the additional costs of redacting log strings via the
`redact.Sprintf` function.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

We still slightly regress in the (fairly sensitive) "select single row"
benchmark, but it's a lot more palatable now:

```
$ benchstat old.txt new.txt
name                                           old time/op    new time/op    delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16    1.06ms ± 1%    1.07ms ± 1%  +1.25%  (p=0.002 n=7+7)

name                                           old alloc/op   new alloc/op   delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     155kB ± 2%     164kB ± 1%  +5.50%  (p=0.001 n=7+6)

name                                           old allocs/op  new allocs/op  delta
Tracing/MultinodeCockroach/tracing=t/Scan1-16     2.27k ± 1%     2.33k ± 1%  +2.39%  (p=0.001 n=7+7)
```

I did take a quick glance at the CPU profiles but nothing obvious turned
up. I might have to look at an allocation profile.

Touches cockroachdb#70562.

Original issue investigation the kicked off concerns about performance
impact of redaction:
cockroachdb#70110 (comment)

Release note: None
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Dec 6, 2021
When we discovered that tenants could "trace into" KV, we considered this a
security problem since no mechanism ensured that tenants could receive data
they are allowed to see. The KV team reacted to this with cockroachdb#70562, which makes
all verbose traces *redactable* and by redacting them at the KV-Tenant boundary.

Serverless has shipped with this redactability in place in order to
prevent data leakage across the boundary with KV and now needs to get
back on the `release-21.2` branch to sync with the rest of development.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled).

This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes
via the additional costs of redacting log strings via the
`redact.Sprintf` function.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

The approach taken is that tracers and the spans they generate can be
marked as "redactable". If the flag is enabled, logging events to the
span will use full redaction via `redact.Sprintf`. If the flag is not
enabled, we implement a manual "coarse redaction" of the entire message
by surrounding it with redaction markers. We take a "fast path" by
checking to see if there are any existing redaction markers to escape in
the message, making the assumption that usually there aren't. This
approach maintains redactability (to varying degrees) of traces while
reducing the performance hit of tracing in the case where we don't want
detailed redactability. A further commit will enable redactability on
tenant-bound traces since we do want to maintain that functionality in
multi-tenant deployments.

Original issue investigation that kicked off concerns about performance
impact of redaction:
cockroachdb#70110 (comment)

----
Benchmarks

The base for all 3 below benchmarks that's used is the code on
`release-21.2` with the "drop traces for tenants" commit reverted. This
reproduces a reasonable "base" state where tracing redaction isn't
really introduced at all.

Additionally, all benchmarks are run with modifications introduced in
the first two commits in cockroachdb#71610 which force tracing on in benchmarks by
running `SET CLUSTER SETTING trace.debug.enable = true` in the regular
and multinode scenarios shown below.

Below are tracing benchmarks for the conditional redaction that
is in **this** commit.
```
~/go/src/github.com/cockroachdb/cockroach make-traces-redactable-and-default-to-off--v2* ≡
19:36:43 ❯ benchstat 83fc452-no-tracing-bench.txt 2bea5a2ac5-conditional-redaction-simple.txt
name                                           old time/op    new time/op    delta
Tracing/Cockroach/tracing=f/Scan1-8               612µs ± 9%     591µs ±16%     ~     (p=0.393 n=10+10)
Tracing/Cockroach/tracing=f/Insert-8              785µs ± 4%     767µs ± 6%     ~     (p=0.360 n=8+10)
Tracing/Cockroach/tracing=t/Scan1-8               597µs ± 1%     542µs ± 1%   -9.23%  (p=0.000 n=10+9)
Tracing/Cockroach/tracing=t/Insert-8              762µs ± 3%     739µs ± 8%     ~     (p=0.315 n=10+10)
Tracing/MultinodeCockroach/tracing=f/Scan1-8     1.05ms ±30%    0.92ms ± 5%  -12.57%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=f/Insert-8    1.19ms ±15%    1.13ms ± 9%   -5.39%  (p=0.035 n=10+10)
Tracing/MultinodeCockroach/tracing=t/Scan1-8      979µs ± 3%     945µs ± 6%   -3.42%  (p=0.024 n=9+9)
Tracing/MultinodeCockroach/tracing=t/Insert-8    1.54ms ±48%    1.19ms ±16%  -22.76%  (p=0.011 n=10+10)

name                                           old alloc/op   new alloc/op   delta
Tracing/Cockroach/tracing=f/Scan1-8              86.5kB ± 1%    90.2kB ± 3%   +4.29%  (p=0.000 n=9+9)
Tracing/Cockroach/tracing=f/Insert-8              109kB ± 3%     111kB ± 0%   +2.68%  (p=0.005 n=9+7)
Tracing/Cockroach/tracing=t/Scan1-8              86.0kB ± 1%    89.8kB ± 0%   +4.42%  (p=0.000 n=10+9)
Tracing/Cockroach/tracing=t/Insert-8              108kB ± 3%     112kB ± 3%   +3.05%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=f/Scan1-8      156kB ±37%     149kB ± 1%     ~     (p=0.050 n=9+9)
Tracing/MultinodeCockroach/tracing=f/Insert-8     154kB ±10%     154kB ± 3%     ~     (p=0.382 n=8+8)
Tracing/MultinodeCockroach/tracing=t/Scan1-8      145kB ± 1%     150kB ± 5%   +3.61%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=t/Insert-8     157kB ±13%     183kB ±40%     ~     (p=0.237 n=8+10)

name                                           old allocs/op  new allocs/op  delta
Tracing/Cockroach/tracing=f/Scan1-8               1.44k ± 0%     1.50k ± 0%   +4.22%  (p=0.000 n=9+8)
Tracing/Cockroach/tracing=f/Insert-8              1.64k ± 0%     1.72k ± 0%   +4.39%  (p=0.000 n=9+9)
Tracing/Cockroach/tracing=t/Scan1-8               1.44k ± 0%     1.50k ± 0%   +4.33%  (p=0.000 n=8+7)
Tracing/Cockroach/tracing=t/Insert-8              1.64k ± 0%     1.72k ± 0%   +4.34%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=f/Scan1-8      2.41k ±11%     2.38k ± 1%     ~     (p=0.150 n=10+9)
Tracing/MultinodeCockroach/tracing=f/Insert-8     2.15k ± 1%     2.22k ± 1%   +3.35%  (p=0.000 n=8+8)
Tracing/MultinodeCockroach/tracing=t/Scan1-8      2.31k ± 1%     2.40k ± 4%   +4.19%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=t/Insert-8     2.17k ± 6%     2.33k ±12%   +7.43%  (p=0.012 n=8+10)
```

These statistics are the *base* hit we take on the redaction work
**prior** to any conditional redaction being added. This is what we're
trying to improve on.
```
~/go/src/github.com/cockroachdb/cockroach make-traces-redactable-and-default-to-off--v2* ≡
19:36:49 ❯ benchstat 83fc452-no-tracing-bench.txt dc42e37-after-redaction.txt
name                                           old time/op    new time/op    delta
Tracing/Cockroach/tracing=f/Scan1-8               612µs ± 9%     693µs ± 2%  +13.22%  (p=0.000 n=10+10)
Tracing/Cockroach/tracing=f/Insert-8              785µs ± 4%     878µs ± 1%  +11.86%  (p=0.000 n=8+9)
Tracing/Cockroach/tracing=t/Scan1-8               597µs ± 1%     687µs ± 3%  +15.05%  (p=0.000 n=10+9)
Tracing/Cockroach/tracing=t/Insert-8              762µs ± 3%     878µs ± 1%  +15.35%  (p=0.000 n=10+8)
Tracing/MultinodeCockroach/tracing=f/Scan1-8     1.05ms ±30%    1.06ms ± 2%     ~     (p=0.050 n=9+9)
Tracing/MultinodeCockroach/tracing=f/Insert-8    1.19ms ±15%    1.28ms ±11%   +7.85%  (p=0.035 n=10+10)
Tracing/MultinodeCockroach/tracing=t/Scan1-8      979µs ± 3%    1066µs ± 1%   +8.87%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=t/Insert-8    1.54ms ±48%    1.26ms ± 2%     ~     (p=0.515 n=10+8)

name                                           old alloc/op   new alloc/op   delta
Tracing/Cockroach/tracing=f/Scan1-8              86.5kB ± 1%    98.0kB ± 1%  +13.33%  (p=0.000 n=9+9)
Tracing/Cockroach/tracing=f/Insert-8              109kB ± 3%     125kB ± 1%  +15.09%  (p=0.000 n=9+9)
Tracing/Cockroach/tracing=t/Scan1-8              86.0kB ± 1%    97.9kB ± 1%  +13.90%  (p=0.000 n=10+9)
Tracing/Cockroach/tracing=t/Insert-8              108kB ± 3%     125kB ± 1%  +15.53%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=f/Scan1-8      156kB ±37%     162kB ± 1%     ~     (p=0.059 n=9+8)
Tracing/MultinodeCockroach/tracing=f/Insert-8     154kB ±10%     197kB ±43%  +28.54%  (p=0.000 n=8+10)
Tracing/MultinodeCockroach/tracing=t/Scan1-8      145kB ± 1%     162kB ± 1%  +11.81%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=t/Insert-8     157kB ±13%     172kB ± 5%   +9.84%  (p=0.007 n=8+8)

name                                           old allocs/op  new allocs/op  delta
Tracing/Cockroach/tracing=f/Scan1-8               1.44k ± 0%     1.66k ± 0%  +15.59%  (p=0.000 n=9+9)
Tracing/Cockroach/tracing=f/Insert-8              1.64k ± 0%     1.95k ± 0%  +18.48%  (p=0.000 n=9+8)
Tracing/Cockroach/tracing=t/Scan1-8               1.44k ± 0%     1.66k ± 0%  +15.46%  (p=0.000 n=8+9)
Tracing/Cockroach/tracing=t/Insert-8              1.64k ± 0%     1.95k ± 1%  +18.52%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=f/Scan1-8      2.41k ±11%     2.58k ± 1%     ~     (p=0.105 n=10+8)
Tracing/MultinodeCockroach/tracing=f/Insert-8     2.15k ± 1%     2.58k ±12%  +20.04%  (p=0.000 n=8+10)
Tracing/MultinodeCockroach/tracing=t/Scan1-8      2.31k ± 1%     2.59k ± 1%  +12.09%  (p=0.000 n=9+9)
Tracing/MultinodeCockroach/tracing=t/Insert-8     2.17k ± 6%     2.47k ± 1%  +13.76%  (p=0.000 n=8+8)
```

Touches cockroachdb#70562.

Release note: None
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Dec 8, 2021
When we discovered that tenants could "trace into" KV, we considered
this a security problem since no mechanism ensured that tenants could
receive data they are allowed to see. The KV team reacted to this with
them at the KV-Tenant boundary.

Serverless has shipped with this redactability in place in order to
prevent data leakage across the boundary with KV and now needs to get
back on the `release-21.2` branch to sync with the rest of development.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled).

This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes
via the additional costs of redacting log strings via the
`redact.Sprintf` function.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

The approach taken is that tracers and the spans they generate can be
marked as "redactable". If the flag is enabled, logging events to the
span will use full redaction via `redact.Sprintf`. If the flag is not
enabled a coarse redation of the entire message is done instead. A
further commit will enable redactability on tenant-bound traces since we
do want to maintain that functionality in multi-tenant deployments.

Original issue investigation that kicked off concerns about performance
impact of redaction:
cockroachdb#70110 (comment)

----
Benchmarks

TODO(davidh): Need to re-do these on `master`

Touches cockroachdb#70562.

Release note: None
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Dec 8, 2021
When we discovered that tenants could "trace into" KV, we considered
this a security problem since no mechanism ensured that tenants could
receive data they are allowed to see. The KV team reacted to this with
them at the KV-Tenant boundary.

Serverless has shipped with this redactability in place in order to
prevent data leakage across the boundary with KV and now needs to get
back on the `release-21.2` branch to sync with the rest of development.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled).

This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes
via the additional costs of redacting log strings via the
`redact.Sprintf` function.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

The approach taken is that tracers and the spans they generate can be
marked as "redactable". If the flag is enabled, logging events to the
span will use full redaction via `redact.Sprintf`. If the flag is not
enabled a coarse redation of the entire message is done instead. A
further commit will enable redactability on tenant-bound traces since we
do want to maintain that functionality in multi-tenant deployments.

Original issue investigation that kicked off concerns about performance
impact of redaction:
cockroachdb#70110 (comment)

Touches cockroachdb#70562.

Release note: None
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Dec 9, 2021
When we discovered that tenants could "trace into" KV, we considered
this a security problem since no mechanism ensured that tenants could
receive data they are allowed to see. The KV team reacted to this with
them at the KV-Tenant boundary.

Serverless has shipped with this redactability in place in order to
prevent data leakage across the boundary with KV and now needs to get
back on the `release-21.2` branch to sync with the rest of development.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled).

This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes
via the additional costs of redacting log strings via the
`redact.Sprintf` function.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

The approach taken is that tracers and the spans they generate can be
marked as "redactable". If the flag is enabled, logging events to the
span will use full redaction via `redact.Sprintf`. If the flag is not
enabled a coarse redation of the entire message is done instead. A
further commit will enable redactability on tenant-bound traces since we
do want to maintain that functionality in multi-tenant deployments.

Original issue investigation that kicked off concerns about performance
impact of redaction:
cockroachdb#70110 (comment)

Touches cockroachdb#70562.

Release note: None
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Dec 9, 2021
When we discovered that tenants could "trace into" KV, we considered
this a security problem since no mechanism ensured that tenants could
receive data they are allowed to see. The KV team reacted to this with
them at the KV-Tenant boundary.

Serverless has shipped with this redactability in place in order to
prevent data leakage across the boundary with KV and now needs to get
back on the `release-21.2` branch to sync with the rest of development.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled).

This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes
via the additional costs of redacting log strings via the
`redact.Sprintf` function.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

The approach taken is that tracers and the spans they generate can be
marked as "redactable". If the flag is enabled, logging events to the
span will use full redaction via `redact.Sprintf`. If the flag is not
enabled a coarse redation of the entire message is done instead. A
further commit will enable redactability on tenant-bound traces since we
do want to maintain that functionality in multi-tenant deployments.

Original issue investigation that kicked off concerns about performance
impact of redaction:
cockroachdb#70110 (comment)

Touches cockroachdb#70562.

Release note: None
dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Dec 9, 2021
When we discovered that tenants could "trace into" KV, we considered
this a security problem since no mechanism ensured that tenants could
receive data they are allowed to see. The KV team reacted to this with
them at the KV-Tenant boundary.

Serverless has shipped with this redactability in place in order to
prevent data leakage across the boundary with KV and now needs to get
back on the `release-21.2` branch to sync with the rest of development.

Unfortunately, making traces redactable had a [significant] performance
overhead. On simple selects, this overhead is in the 10-20% area (when
verbose tracing is enabled).

This presents a stability risk, since some customers run with verbose
tracing enabled for all queries. Once they switch to 21.2, their system
may be underprovisioned as a result of the newly grown tracing overhead
in the `log.{Ve,E}vent{,f}` family of methods. The peformance hit comes
via the additional costs of redacting log strings via the
`redact.Sprintf` function.

This PR changes makes trace redactability a choice. Because we needed to
support mixed-version clusters in cockroachdb#70562 this is not as big a lift as
might have been expected.

The approach taken is that tracers and the spans they generate can be
marked as "redactable". If the flag is enabled, logging events to the
span will use full redaction via `redact.Sprintf`. If the flag is not
enabled a coarse redation of the entire message is done instead. A
further commit will enable redactability on tenant-bound traces since we
do want to maintain that functionality in multi-tenant deployments.

Original issue investigation that kicked off concerns about performance
impact of redaction:
cockroachdb#70110 (comment)

Touches cockroachdb#70562.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants