Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: perf regression due to trace redactability #71694

Closed
tbg opened this issue Oct 19, 2021 · 6 comments · Fixed by #73405
Closed

tracing: perf regression due to trace redactability #71694

tbg opened this issue Oct 19, 2021 · 6 comments · Fixed by #73405
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Oct 19, 2021

Describe the problem

In #58610 we discussed trace data sent to tenants from the KV layer. This data
was not redactable and there was a concern that tenants could learn privileged
information (such as keys belonging to another tenant, for example).

We reacted to this by adding redactability for traces (#70562). Unfortunately,
this introduced inacceptably high overhead in the case in which tracing is
enabled across the board (as will be the case with the
sql.trace.*.enable_threshold cluster setting, which is known to see some
usage) and, on release-21.2, we switched to a blunter approach with degraded
observability, #71606. This degraded observability is tracked in #58610 and is
out of scope here as it does not affect master (at the time of writing).

Details on the regression can be found in
#71610. Concretely, in a simple
"select single row" end-to-end local benchmark, we see approximately a 20%
regression in average request latency.

We have to either mitigate the regression or revert the work but are then
forced to add the redactability again in some way, as for better or worse
traces as observed by tenants are an invaluable tool in CockroachDB support
today.

Epic CRDB-2574

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 19, 2021
craig bot pushed a commit that referenced this issue Oct 19, 2021
70792: server: expect duplicate session cookies on HTTP requests r=catj-cockroach,knz a=dhartunian

Previously, the golang HTTP library was used to retrieve a cookie with
the specific name we use to store our authentication token. This API
method retrieves the *first* matching cookie in the header and ignores
the rest. This can cause issues with requests that contain multiple
matching cookies with the same name (if other applications are
potentially running on the same domain).

This change adjusts the HTTP API we use to retrieve cookies (we now
prefer to call `req.Cookies()` and iterate through the entire list
ourselves, instead of relying on the API to give us the matching cookie
since it only supports returning a single result.

Resolves #70376.

Release note (security update): Authenticated HTTP requests to nodes can
now contain additional cookies with the same name as the one we use
("session"). Duplicates like this are permitted by the HTTP spec and our
code will now attempt to parse all cookies with a matching name before
giving up. This can resolve issues with customers who run other services
on the same domain as their CRDB nodes.

71503: server: Tenant pods fan out to cancel SQL queries r=matthewtodd a=matthewtodd

Partially addresses #70832.

Previously, the tenant `/_status/cancel_query` endpoint was not
available over HTTP and could not cancel a query not running locally.
This change opens it up, so that we can call it from the cloud console,
and makes sure that that we can cancel queries when a tenant has more
than one SQL pod running.

Release note (api change): The tenant `/_status/cancel_query` endpoint
was made available over HTTP and augmented to fan out requests in a
multi-pod world.

71677: row: pass in account instead of monitor for KV fetcher r=yuzefovich a=yuzefovich

This commit switches the KV fetcher to expect a memory account rather
than a memory monitor. It additionally creates a separate memory account
for the KV fetcher when it is used by the cFetcher (previously, we were
reusing the monitor from the allocator).

Note that we don't need to update multiple places where we call methods
on possibly nil memory account because all methods have been taught to
be noops when the receiver is `nil`.

Fixes: #55481.

Release note: None

71698: roachtest: add kv95/enc=false/nodes=3/tracing r=[dhartunian,stevendanna] a=tbg

We don't have a good grasp on what it means for performance to turn on
tracing globally. A kv95 (read-mostly) workload is a good baseline here
as we expect it to be among the most severely impacted.

Touches #71694.
Touches #58610.

Release note: None


Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@dhartunian
Copy link
Collaborator

After discussing with @knz and @andreimatei we've decided on the following two-part solution to this problem:

  1. Reintroduce the original fix from @tbg in release-21.2: default redactability to "off" #71610. This approach was set aside in favor of release-21.2: revert trace redactability, drop tenant-bound recordings instead #71606 during the stability period since we favored a simpler solution instead of one that introduced more conditional logic. This will ensure that the redaction code is utilized in the release branch, but not on single-tenant deployments where it caused performance regressions we didn't want to ship with. These performance hits will be easier to manage on multi-tenant deployments where we will never enable tracing on all queries anyway.

  2. In situations where we do want redaction enabled between kv and tenant layers, we will add redactabilty to structured payloads. We will ensure that contention payloads are passed through properly to keep sql observability features that use those payloads operational in multi-tenant deployments. After that, we will progressively add more granular redaction to structured payloads to facilitate debugging. Since usage of log.VEventf is difficult to improve performance-wise, we may choose to omit those log statements from logs that pass from kv to the tenant layer in situations where we enforce redaction.

@dhartunian
Copy link
Collaborator

After discussing this issue with more folks, it seems that the original need for #71606 is not as strong as initially assumed. We do not want to keep a "wall" between tenants and KV when tracing and customers can expect to learn and interpret information from the KV layer to debug query execution on multi-tenant clusters.

In that spirit, I will be simplifying the immediate work on the release-21.2 branch: simply reverting the final commit in #71606 (fc95ad8) to drop tenant-bound trace recordings. Our existing boundaries between tenants should do a good job of preventing leakage of information across the tenant boundary since all range-related info will be confined to the tenant.

Separately, we will continue to pursue trace redaction on master for the 22.1 release for customers with the intent of reducing PII captured in traces. That concern maps equally over both single and multi-tenant customers and the code currently on release-21.2 does not offer new functionality here that the revert I described will regress on.

@andreimatei
Copy link
Contributor

Our existing boundaries between tenants should do a good job of preventing leakage of information across the tenant boundary since all range-related info will be confined to the tenant.

I imagine that the original reason we panicked about these recording coming from tenants was about RPCs that internally cross range and tenant boundaries. Like I'm thinking a scan that internally does a PushTxn on a system transaction, or even on a transaction from the same tenant but where we have to scan the meta ranges to figure out where the respective txn record is. Is all this a ruse?

dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 1, 2021
This reverts commit fc95ad8.

This PR restores the ability for tenant traces to see unstructured
recordings from the KV layer.

Please see note in cockroachdb#71694: cockroachdb#71694 (comment)

The need to enforce such strict redaction of traces for tenant queries
is not necessary. Customer views into the KV layer will be necessary in
order to debug query execution in multi-tenant settings. Redaction is
certainly needed at this boundary but must be configured more carefully.
Further PRs as necessitated by cockroachdb#58610 and cockroachdb#71694 will tackle these
issues on master (where trace redaction is still available) and will be
backported to the release branch as needed.

This PR restores tracing into KV for tenants and unblocks the ability
for serverless to use the release-21.2 branch.

Release note (ops change): Tracing queries from tenants will now include
unstructured log messages from the kv layer.
@dhartunian
Copy link
Collaborator

I imagine that the original reason we panicked about these recording coming from tenants was about RPCs that internally cross range and tenant boundaries. Like I'm thinking a scan that internally does a PushTxn on a system transaction, or even on a transaction from the same tenant but where we have to scan the meta ranges to figure out where the respective txn record is. Is all this a ruse?

How viable is it to audit all VEvent calls for stuff like this? It seems like the majority stay within Range boundaries so that's why a globally optimal redaction solution is harder to get to right now.

@tbg
Copy link
Member Author

tbg commented Dec 1, 2021

We need redaction on that boundary for sure (if the proposal is to leak unredacted KV traces to tenants then I think we need to talk about this again, certainly our security team would want to sign off on that and I'd be surprised if they did). But we do not need to prevent redacted KV trace recordings from reaching the tenant. This isn't thought to be useful for users directly but it is important for our ability to support users, as we don't currently have infrastructure to capture these traces internally.

dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 2, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Span's for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of redaction from propagating to
non-tenant deployments.

Resolves cockroachdb#71694

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 2, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Span's for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of redaction from propagating to
non-tenant deployments.

Resolves cockroachdb#71694

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 2, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Span's for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of redaction from propagating to
non-tenant deployments.

Resolves cockroachdb#71694

Release note: None
@dhartunian
Copy link
Collaborator

There’s been a lot of back and forth on how to address this ticket in light of Serverless needs on the release-21.2 branch so I wanted to quickly summarize the reasoning behind all the commits that are in the PR I have up on the release branch right now: #73117

Here’s a summary of conflicting needs:

  1. Trace redactability as implemented on master via log,kvserver: hand redacted KV traces to tenants #70562 created performance regressions that we could not ship in 21.2. Hence, this work was reverted on release-21.2.

  2. Trace redactability is critical for Serverless needs due to the fact that KV traces can contain sensitive information that crosses range (and therefore tenant) boundaries. Currently, the branch that Serverless launched with contains the PR above and takes the performance hit in order to maintain redactability for traces.

  3. Serverless needs to switch to using the release-21.2 branch in order to get back in line with the rest of the codebase. It branched during launch for stability. It currently cannot do this because of release-21.2: revert trace redactability, drop tenant-bound recordings instead #71606 which dropped redactability.

  4. We cannot “turn off” KV tracing data for Serverless tenants as the PR mentioned in (3) does because we need them for debugging customer queries.

What are we doing?

The way forward is contained in #73117 which:

Consequences of the above PR:

  • Serverless can use release-21.2 branch without losing redactable traces
  • Dedicated/On-Prem customers can continue using 21.2 patches without redactability being enabled on their traces (avoiding the performance cost associated)

Alternative paths considered

  • Don’t redact traces.
    • This creates a security issue where sensitive info from other tenants might become visible to Serverless customers
  • Don’t redact traces. Turn off KV traces for Serverless
    • At this point this would be a feature regression on Serverless
    • We need these traces to debug customer queries ourselves and there is no easy mechanism for managing different kinds of redaction for different users
  • Don’t redact traces. Omit the KV trace data in the SQL shell when “SHOW TRACE FOR SESSION” is used (but continue showing in logs)
    • This would keep the trace data only where we have access (logs) and omit from customer eyes (customers can see SQL, but not logs)
    • This is a hacky solution that breaks the way a feature works for customers
    • Relies on customers not having access to logs which may not be true in the future
    • Requires debugging queries to look at logs which likely requires SRE intervention or something
  • Make trace redactability a cluster setting
    • Don’t want to add a cluster setting on release branch…is this even allowed?

dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 2, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Span's for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of redaction from propagating to
non-tenant deployments.

Resolves cockroachdb#71694

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 3, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Span's for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of redaction from propagating to
non-tenant deployments.

Resolves cockroachdb#71694

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 3, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Span's for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of redaction from propagating to
non-tenant deployments.

Resolves cockroachdb#71694

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 6, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Spans for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of fine grained redaction from
propagating to non-tenant deployments.

Resolves cockroachdb#71694

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 8, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Spans for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of fine grained redaction from
propagating to non-tenant deployments.

Resolves cockroachdb#71694

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 8, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Spans for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of fine grained redaction from
propagating to non-tenant deployments.

Benchmarking this against master results in the following improvements.
Verbose redaction is enabled in the test via SET tracing = on.

* verbose-redaction-always.txt refers to master @ a64af2c
which has redaction turned on always
* verbose-redaction-conditional-off.txt refers to the top of this PR @
51a8c72 which implements conditional redaction and turns it off for the
test

We gain back a significant portion of the performance hit.

```
10:39:06 ❯ benchstat verbose-redaction-always.txt  verbose-redaction-conditional-off.txt
name                                                        old time/op    new time/op    delta
Tracing/1node/scan/trace=100%/redactable=on/verbose=on-8       890µs ± 8%     795µs ±10%  -10.68%  (p=0.000 n=10+10)
Tracing/1node/insert/trace=100%/redactable=on/verbose=on-8    1.05ms ± 4%    0.95ms ± 1%   -9.17%  (p=0.000 n=9+8)
Tracing/3node/scan/trace=100%/redactable=on/verbose=on-8       921µs ± 4%     875µs ± 9%   -5.09%  (p=0.006 n=9+9)
Tracing/3node/insert/trace=100%/redactable=on/verbose=on-8    1.12ms ±21%    1.02ms ± 5%   -9.35%  (p=0.010 n=10+9)

name                                                        old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=100%/redactable=on/verbose=on-8       133kB ± 0%     135kB ± 0%   +1.76%  (p=0.000 n=9+9)
Tracing/1node/insert/trace=100%/redactable=on/verbose=on-8     165kB ± 0%     164kB ± 0%   -0.47%  (p=0.000 n=10+9)
Tracing/3node/scan/trace=100%/redactable=on/verbose=on-8       170kB ± 0%     170kB ± 1%     ~     (p=0.796 n=10+10)
Tracing/3node/insert/trace=100%/redactable=on/verbose=on-8     154kB ± 2%     155kB ± 1%     ~     (p=0.796 n=10+10)

name                                                        old allocs/op  new allocs/op  delta
Tracing/1node/scan/trace=100%/redactable=on/verbose=on-8       2.00k ± 0%     1.76k ± 0%  -12.02%  (p=0.000 n=9+8)
Tracing/1node/insert/trace=100%/redactable=on/verbose=on-8     2.36k ± 0%     2.01k ± 0%  -14.64%  (p=0.000 n=10+10)
Tracing/3node/scan/trace=100%/redactable=on/verbose=on-8       1.92k ± 0%     1.81k ± 1%   -5.69%  (p=0.000 n=10+10)
Tracing/3node/insert/trace=100%/redactable=on/verbose=on-8     1.68k ± 1%     1.56k ± 1%   -7.45%  (p=0.000 n=10+10)
```

Resolves cockroachdb#71694

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 8, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
logic to selectively force redactability enabled on Spans for requests
that come from a non-system tenant.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of fine grained redaction from
propagating to non-tenant deployments.

Benchmarking this against master results in the following improvements.
Verbose redaction is enabled in the test via SET tracing = on.

* verbose-redaction-always.txt refers to master @ a64af2c
which has redaction turned on always
* verbose-redaction-conditional-off.txt refers to the top of this PR @
51a8c72 which implements conditional redaction and turns it off for the
test

We gain back a significant portion of the performance hit.

```
10:39:06 ❯ benchstat verbose-redaction-always.txt  verbose-redaction-conditional-off.txt
name                                                        old time/op    new time/op    delta
Tracing/1node/scan/trace=100%/redactable=on/verbose=on-8       890µs ± 8%     795µs ±10%  -10.68%  (p=0.000 n=10+10)
Tracing/1node/insert/trace=100%/redactable=on/verbose=on-8    1.05ms ± 4%    0.95ms ± 1%   -9.17%  (p=0.000 n=9+8)
Tracing/3node/scan/trace=100%/redactable=on/verbose=on-8       921µs ± 4%     875µs ± 9%   -5.09%  (p=0.006 n=9+9)
Tracing/3node/insert/trace=100%/redactable=on/verbose=on-8    1.12ms ±21%    1.02ms ± 5%   -9.35%  (p=0.010 n=10+9)

name                                                        old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=100%/redactable=on/verbose=on-8       133kB ± 0%     135kB ± 0%   +1.76%  (p=0.000 n=9+9)
Tracing/1node/insert/trace=100%/redactable=on/verbose=on-8     165kB ± 0%     164kB ± 0%   -0.47%  (p=0.000 n=10+9)
Tracing/3node/scan/trace=100%/redactable=on/verbose=on-8       170kB ± 0%     170kB ± 1%     ~     (p=0.796 n=10+10)
Tracing/3node/insert/trace=100%/redactable=on/verbose=on-8     154kB ± 2%     155kB ± 1%     ~     (p=0.796 n=10+10)

name                                                        old allocs/op  new allocs/op  delta
Tracing/1node/scan/trace=100%/redactable=on/verbose=on-8       2.00k ± 0%     1.76k ± 0%  -12.02%  (p=0.000 n=9+8)
Tracing/1node/insert/trace=100%/redactable=on/verbose=on-8     2.36k ± 0%     2.01k ± 0%  -14.64%  (p=0.000 n=10+10)
Tracing/3node/scan/trace=100%/redactable=on/verbose=on-8       1.92k ± 0%     1.81k ± 1%   -5.69%  (p=0.000 n=10+10)
Tracing/3node/insert/trace=100%/redactable=on/verbose=on-8     1.68k ± 1%     1.56k ± 1%   -7.45%  (p=0.000 n=10+10)
```

Resolves cockroachdb#71694

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 9, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
a cluster setting that will enable fine-grained trace redactability. The
intended use is for multi-tenant deployments where KV traces should be
redacted prior to being returned to tenants. Enabling the flag will enable
fine-grained redaction so that tenant operators can continue to debug
their queries via redacted kv trace information instead of having that
information stripped out.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of fine grained redaction from
propagating to non-tenant deployments.

Benchmarking this against master results in the following improvements.
Verbose redaction is enabled in the test via SET tracing = on.

* verbose-redaction-always.txt refers to master @ a64af2c
which has redaction turned on always
* verbose-redaction-conditional-off.txt refers to the top of this PR @
51a8c72 which implements conditional redaction and turns it off for the
test

We gain back a significant portion of the performance hit.

```
10:39:06 ❯ benchstat verbose-redaction-always.txt  verbose-redaction-conditional-off.txt
name                                                        old time/op    new time/op    delta
Tracing/1node/scan/trace=100%/redactable=on/verbose=on-8       890µs ± 8%     795µs ±10%  -10.68%  (p=0.000 n=10+10)
Tracing/1node/insert/trace=100%/redactable=on/verbose=on-8    1.05ms ± 4%    0.95ms ± 1%   -9.17%  (p=0.000 n=9+8)
Tracing/3node/scan/trace=100%/redactable=on/verbose=on-8       921µs ± 4%     875µs ± 9%   -5.09%  (p=0.006 n=9+9)
Tracing/3node/insert/trace=100%/redactable=on/verbose=on-8    1.12ms ±21%    1.02ms ± 5%   -9.35%  (p=0.010 n=10+9)

name                                                        old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=100%/redactable=on/verbose=on-8       133kB ± 0%     135kB ± 0%   +1.76%  (p=0.000 n=9+9)
Tracing/1node/insert/trace=100%/redactable=on/verbose=on-8     165kB ± 0%     164kB ± 0%   -0.47%  (p=0.000 n=10+9)
Tracing/3node/scan/trace=100%/redactable=on/verbose=on-8       170kB ± 0%     170kB ± 1%     ~     (p=0.796 n=10+10)
Tracing/3node/insert/trace=100%/redactable=on/verbose=on-8     154kB ± 2%     155kB ± 1%     ~     (p=0.796 n=10+10)

name                                                        old allocs/op  new allocs/op  delta
Tracing/1node/scan/trace=100%/redactable=on/verbose=on-8       2.00k ± 0%     1.76k ± 0%  -12.02%  (p=0.000 n=9+8)
Tracing/1node/insert/trace=100%/redactable=on/verbose=on-8     2.36k ± 0%     2.01k ± 0%  -14.64%  (p=0.000 n=10+10)
Tracing/3node/scan/trace=100%/redactable=on/verbose=on-8       1.92k ± 0%     1.81k ± 1%   -5.69%  (p=0.000 n=10+10)
Tracing/3node/insert/trace=100%/redactable=on/verbose=on-8     1.68k ± 1%     1.56k ± 1%   -7.45%  (p=0.000 n=10+10)
```

Resolves cockroachdb#71694

Release note: None
craig bot pushed a commit that referenced this issue Dec 9, 2021
73405: server, tracing: Make traces redactable and default to "off" r=andreimatei a=dhartunian

Multi-tenant deployments require tracing support
in order to help customers and engineers debug query performance
issues. This requires that trace information recorded on the KV layer
be redacted prior to being transmitted to the tenant in order to
guard against data leakage through the multi-tenant abstractions.

This change makes tracing redactability configurable and adds a cluster
setting to enable fine-grained redactability that we can use for
multi-tenant deployments.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of redaction from propagating to
non-tenant deployments.

This PR brings `master` in parity with changes in `release-21.2` via #73117.

Benchmarking this against master results in the following improvements.
old.txt is using `master` @ 65920cf
new.txt is using this commit @ 7c768ac
Benchmark was run using `gceworker`.

We gain back a significant portion of the performance hit:
```
david@davidh:~/go/src/github.com/cockroachdb/cockroach$ benchstat old.txt new.txt
name                                              old time/op    new time/op    delta
Tracing/1node/scan/trace=100%/threshold=1ms-24      1.20ms ± 4%    1.15ms ± 2%   -4.83%  (p=0.000 n=10+9)
Tracing/1node/insert/trace=100%/threshold=1ms-24    1.51ms ± 3%    1.40ms ± 2%   -6.71%  (p=0.000 n=10+10)
Tracing/3node/scan/trace=100%/threshold=1ms-24      2.17ms ± 3%    2.10ms ± 4%   -3.26%  (p=0.003 n=10+10)
Tracing/3node/insert/trace=100%/threshold=1ms-24    2.41ms ± 3%    2.30ms ± 3%   -4.45%  (p=0.000 n=8+9)

name                                              old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=100%/threshold=1ms-24       115kB ± 1%     120kB ± 1%   +4.39%  (p=0.000 n=10+10)
Tracing/1node/insert/trace=100%/threshold=1ms-24     149kB ± 1%     152kB ± 0%   +1.92%  (p=0.000 n=10+9)
Tracing/3node/scan/trace=100%/threshold=1ms-24       276kB ± 3%     282kB ± 3%   +2.18%  (p=0.035 n=10+10)
Tracing/3node/insert/trace=100%/threshold=1ms-24     265kB ± 1%     264kB ± 2%     ~     (p=0.963 n=8+9)

name                                              old allocs/op  new allocs/op  delta
Tracing/1node/scan/trace=100%/threshold=1ms-24       1.68k ± 0%     1.47k ± 0%  -12.16%  (p=0.000 n=9+9)
Tracing/1node/insert/trace=100%/threshold=1ms-24     2.05k ± 0%     1.76k ± 0%  -14.46%  (p=0.000 n=10+8)
Tracing/3node/scan/trace=100%/threshold=1ms-24       3.10k ± 0%     2.85k ± 0%   -8.21%  (p=0.000 n=10+10)
Tracing/3node/insert/trace=100%/threshold=1ms-24     3.00k ± 0%     2.66k ± 0%  -11.30%  (p=0.000 n=9+10)
```

Resolves #71694

Release note: None

Resolves #71694

Release note: None

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 9, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
a cluster setting that will enable fine-grained trace redactability. The
intended use is for multi-tenant deployments where KV traces should be
redacted prior to being returned to tenants. Enabling the flag will enable
fine-grained redaction so that tenant operators can continue to debug
their queries via redacted kv trace information instead of having that
information stripped out.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of fine grained redaction from
propagating to non-tenant deployments.

Benchmarking this against master results in the following improvements.
old.txt is using `master` @ 65920cf
new.txt is using this commit @ 7c768ac
Benchmark was run using `gceworker`.

We gain back a significant portion of the performance hit:
```
david@davidh:~/go/src/github.com/cockroachdb/cockroach$ benchstat old.txt new.txt
name                                              old time/op    new time/op    delta
Tracing/1node/scan/trace=100%/threshold=1ms-24      1.20ms ± 4%    1.15ms ± 2%   -4.83%  (p=0.000 n=10+9)
Tracing/1node/insert/trace=100%/threshold=1ms-24    1.51ms ± 3%    1.40ms ± 2%   -6.71%  (p=0.000 n=10+10)
Tracing/3node/scan/trace=100%/threshold=1ms-24      2.17ms ± 3%    2.10ms ± 4%   -3.26%  (p=0.003 n=10+10)
Tracing/3node/insert/trace=100%/threshold=1ms-24    2.41ms ± 3%    2.30ms ± 3%   -4.45%  (p=0.000 n=8+9)

name                                              old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=100%/threshold=1ms-24       115kB ± 1%     120kB ± 1%   +4.39%  (p=0.000 n=10+10)
Tracing/1node/insert/trace=100%/threshold=1ms-24     149kB ± 1%     152kB ± 0%   +1.92%  (p=0.000 n=10+9)
Tracing/3node/scan/trace=100%/threshold=1ms-24       276kB ± 3%     282kB ± 3%   +2.18%  (p=0.035 n=10+10)
Tracing/3node/insert/trace=100%/threshold=1ms-24     265kB ± 1%     264kB ± 2%     ~     (p=0.963 n=8+9)

name                                              old allocs/op  new allocs/op  delta
Tracing/1node/scan/trace=100%/threshold=1ms-24       1.68k ± 0%     1.47k ± 0%  -12.16%  (p=0.000 n=9+9)
Tracing/1node/insert/trace=100%/threshold=1ms-24     2.05k ± 0%     1.76k ± 0%  -14.46%  (p=0.000 n=10+8)
Tracing/3node/scan/trace=100%/threshold=1ms-24       3.10k ± 0%     2.85k ± 0%   -8.21%  (p=0.000 n=10+10)
Tracing/3node/insert/trace=100%/threshold=1ms-24     3.00k ± 0%     2.66k ± 0%  -11.30%  (p=0.000 n=9+10)
```

Resolves cockroachdb#71694

Release note: None
@craig craig bot closed this as completed in 5cbdec8 Dec 9, 2021
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 9, 2021
Prior commits added the ability to set a flag on the Tracer to determine
whether Span's should be marked redactable or not. This commit adds
a cluster setting that will enable fine-grained trace redactability. The
intended use is for multi-tenant deployments where KV traces should be
redacted prior to being returned to tenants. Enabling the flag will enable
fine-grained redaction so that tenant operators can continue to debug
their queries via redacted kv trace information instead of having that
information stripped out.

The desired outcome is to preserve redactability for tenant traces while
preserving a non-redactable code path for requests using the system
tenant. This preserves existing functionality in production while
keeping the current performance hit of fine grained redaction from
propagating to non-tenant deployments.

Benchmarking this against master results in the following improvements.
old.txt is using `master` @ 65920cf
new.txt is using this commit @ 7c768ac
Benchmark was run using `gceworker`.

We gain back a significant portion of the performance hit:
```
david@davidh:~/go/src/github.com/cockroachdb/cockroach$ benchstat old.txt new.txt
name                                              old time/op    new time/op    delta
Tracing/1node/scan/trace=100%/threshold=1ms-24      1.20ms ± 4%    1.15ms ± 2%   -4.83%  (p=0.000 n=10+9)
Tracing/1node/insert/trace=100%/threshold=1ms-24    1.51ms ± 3%    1.40ms ± 2%   -6.71%  (p=0.000 n=10+10)
Tracing/3node/scan/trace=100%/threshold=1ms-24      2.17ms ± 3%    2.10ms ± 4%   -3.26%  (p=0.003 n=10+10)
Tracing/3node/insert/trace=100%/threshold=1ms-24    2.41ms ± 3%    2.30ms ± 3%   -4.45%  (p=0.000 n=8+9)

name                                              old alloc/op   new alloc/op   delta
Tracing/1node/scan/trace=100%/threshold=1ms-24       115kB ± 1%     120kB ± 1%   +4.39%  (p=0.000 n=10+10)
Tracing/1node/insert/trace=100%/threshold=1ms-24     149kB ± 1%     152kB ± 0%   +1.92%  (p=0.000 n=10+9)
Tracing/3node/scan/trace=100%/threshold=1ms-24       276kB ± 3%     282kB ± 3%   +2.18%  (p=0.035 n=10+10)
Tracing/3node/insert/trace=100%/threshold=1ms-24     265kB ± 1%     264kB ± 2%     ~     (p=0.963 n=8+9)

name                                              old allocs/op  new allocs/op  delta
Tracing/1node/scan/trace=100%/threshold=1ms-24       1.68k ± 0%     1.47k ± 0%  -12.16%  (p=0.000 n=9+9)
Tracing/1node/insert/trace=100%/threshold=1ms-24     2.05k ± 0%     1.76k ± 0%  -14.46%  (p=0.000 n=10+8)
Tracing/3node/scan/trace=100%/threshold=1ms-24       3.10k ± 0%     2.85k ± 0%   -8.21%  (p=0.000 n=10+10)
Tracing/3node/insert/trace=100%/threshold=1ms-24     3.00k ± 0%     2.66k ± 0%  -11.30%  (p=0.000 n=9+10)
```

Resolves cockroachdb#71694

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants