Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.2: revert trace redactability, drop tenant-bound recordings instead #71606

Merged
merged 4 commits into from
Oct 19, 2021

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 15, 2021

When we discovered that tenants could "trace into" KV, we considered this a
security problem since no mechanism ensured that tenants could receive data
they are allowed to see. The KV team reacted to this with #70562, which makes
all verbose traces redactable and by redacting them at the KV-Tenant boundary.

Unfortunately, making traces redactable had a performance overhead. We have not
properly quantified this at this moment but it appears significant. More so,
it's always present, even in single-tenant deployments. This presents a stability
risk, since some customers run with tracing enabled across the bank. Once they
switch to 21.2, their system may be underprovisioned as a result of the newly grown
tracing overhead.

We can't take that risk and the release is close, so this PR pursues the
alternative of reverting the redactability and instead switching to a very
crude method of sanitizing the recordings before handing them to tenants: just
not sending any recordings.

This leads to a loss of observability on Serverless. However, incremental
improvements can be made, including re-applying the reverted commits in the
future, perhaps with performance mitigations. However, none of this should
happen in the critical path of the release at this point.

Touches #70562.

  • Revert "kvserver: redact tenant-bound trace recordings"
  • Revert "log,tracing: make traces redactable"
  • Revert "tracing: lint against non-const string to Span.Record[f]"
  • server: don't return trace recordings to tenants

@tbg tbg requested a review from a team as a code owner October 15, 2021 09:56
@blathers-crl
Copy link

blathers-crl bot commented Oct 15, 2021

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 15, 2021

Still mulling this over, don't look yet

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reverts :lgtm:

Just want to note it here - the one commit that was part of the original PR that's not being reverted is this one - kvtenantccl: test that tenants can't see KV-level trace messages : e0aceda

It makes sense to me that we keep this test around to verify that the temporary replacement fix in 5e94cec is sufficient.

One question on the new way of avoiding sending KV traces across the RPC boundary to non-system tenants

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @nvanbenschoten, and @tbg)


pkg/server/node.go, line 1086 at r4 (raw file):

if ok

Further up the call stack in Batch() we handle !ok by setting tenantID = roachpb.SystemTenantID.

Should we do the same here and treat !ok as the system tenant?

One question I have is: do we know of any scenarios where we'd expect to be missing a tenantID in the Context, where we wouldn't want to treat it as the system tenant?

cc @andy-kimball for a serverless perspective re: the question posed above

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg we decided during release triage that we would move ahead w/ reverting the commits instead of adding the new optional redaction feature. The motivation was to reduce configuration churn.

I noticed during testing that I can still see traces in tenant SQL via show trace for session, am I missing some understanding or should we prevent the traces from showing up in there as well?

Reviewed 5 of 5 files at r1, 19 of 19 files at r2, 2 of 2 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @nvanbenschoten, and @tbg)

Touches cockroachdb#58610. The preceding commits reverted redactability of traces,
so we need to find another way to prevent sensitive information from
leaking to tenants. At this point, the method is crude: just don't
return any trace recording to tenants at the KV boundary. This is a loss
of observability that could backfire, but we can improve upon it.

Note that we have coverage of this through TestTenantTracesAreRedacted.

Release note: None
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed during testing that I can still see traces in tenant SQL via show trace for session, am I missing some understanding or should we prevent the traces from showing up in there as well?

We're only suppressing KV traces, not general trace messages (SQL itself also puts a bunch of stuff in the trace spans, I assume that is what you were seeing).

I adjusted the commit a bit to strip out only the tags and verbose log messages. This is from #70406 and is necessary because we want to preserve structured payloads (in particular contention events). We know that the redaction story for those similarly doesn't exist, but contention events should be OK and it would be too invasive to take them from tenants.

TFTRs! I updated the PR.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball, @dhartunian, and @nvanbenschoten)


pkg/server/node.go, line 1086 at r4 (raw file):

Previously, abarganier (Alex Barganier) wrote…
if ok

Further up the call stack in Batch() we handle !ok by setting tenantID = roachpb.SystemTenantID.

Should we do the same here and treat !ok as the system tenant?

One question I have is: do we know of any scenarios where we'd expect to be missing a tenantID in the Context, where we wouldn't want to treat it as the system tenant?

cc @andy-kimball for a serverless perspective re: the question posed above

That's a good point, yes, the absence of a tenant indicates the system tenant. Whether we like this or not could be debated (conferring privilege based on the absence of a marker is... yucky), but it seems to be the status quo and tests forced me to make that change anyway.

I doubt Andy has much to add here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants