-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-21.2: revert trace redactability, drop tenant-bound recordings instead #71606
Conversation
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
Still mulling this over, don't look yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to note it here - the one commit that was part of the original PR that's not being reverted is this one - kvtenantccl: test that tenants can't see KV-level trace messages
: e0aceda
It makes sense to me that we keep this test around to verify that the temporary replacement fix in 5e94cec is sufficient.
One question on the new way of avoiding sending KV traces across the RPC boundary to non-system tenants
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @nvanbenschoten, and @tbg)
pkg/server/node.go, line 1086 at r4 (raw file):
if ok
Further up the call stack in Batch() we handle !ok
by setting tenantID = roachpb.SystemTenantID
.
Should we do the same here and treat !ok
as the system tenant?
One question I have is: do we know of any scenarios where we'd expect to be missing a tenantID
in the Context
, where we wouldn't want to treat it as the system tenant?
cc @andy-kimball for a serverless perspective re: the question posed above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg we decided during release triage that we would move ahead w/ reverting the commits instead of adding the new optional redaction feature. The motivation was to reduce configuration churn.
I noticed during testing that I can still see traces in tenant SQL via show trace for session
, am I missing some understanding or should we prevent the traces from showing up in there as well?
Reviewed 5 of 5 files at r1, 19 of 19 files at r2, 2 of 2 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian, @nvanbenschoten, and @tbg)
Touches cockroachdb#58610. The preceding commits reverted redactability of traces, so we need to find another way to prevent sensitive information from leaking to tenants. At this point, the method is crude: just don't return any trace recording to tenants at the KV boundary. This is a loss of observability that could backfire, but we can improve upon it. Note that we have coverage of this through TestTenantTracesAreRedacted. Release note: None
5e94cec
to
fc95ad8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed during testing that I can still see traces in tenant SQL via show trace for session, am I missing some understanding or should we prevent the traces from showing up in there as well?
We're only suppressing KV traces, not general trace messages (SQL itself also puts a bunch of stuff in the trace spans, I assume that is what you were seeing).
I adjusted the commit a bit to strip out only the tags and verbose log messages. This is from #70406 and is necessary because we want to preserve structured payloads (in particular contention events). We know that the redaction story for those similarly doesn't exist, but contention events should be OK and it would be too invasive to take them from tenants.
TFTRs! I updated the PR.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball, @dhartunian, and @nvanbenschoten)
pkg/server/node.go, line 1086 at r4 (raw file):
Previously, abarganier (Alex Barganier) wrote…
if ok
Further up the call stack in Batch() we handle
!ok
by settingtenantID = roachpb.SystemTenantID
.Should we do the same here and treat
!ok
as the system tenant?One question I have is: do we know of any scenarios where we'd expect to be missing a
tenantID
in theContext
, where we wouldn't want to treat it as the system tenant?cc @andy-kimball for a serverless perspective re: the question posed above
That's a good point, yes, the absence of a tenant indicates the system tenant. Whether we like this or not could be debated (conferring privilege based on the absence of a marker is... yucky), but it seems to be the status quo and tests forced me to make that change anyway.
I doubt Andy has much to add here.
When we discovered that tenants could "trace into" KV, we considered this a
security problem since no mechanism ensured that tenants could receive data
they are allowed to see. The KV team reacted to this with #70562, which makes
all verbose traces redactable and by redacting them at the KV-Tenant boundary.
Unfortunately, making traces redactable had a performance overhead. We have not
properly quantified this at this moment but it appears significant. More so,
it's always present, even in single-tenant deployments. This presents a stability
risk, since some customers run with tracing enabled across the bank. Once they
switch to 21.2, their system may be underprovisioned as a result of the newly grown
tracing overhead.
We can't take that risk and the release is close, so this PR pursues the
alternative of reverting the redactability and instead switching to a very
crude method of sanitizing the recordings before handing them to tenants: just
not sending any recordings.
This leads to a loss of observability on Serverless. However, incremental
improvements can be made, including re-applying the reverted commits in the
future, perhaps with performance mitigations. However, none of this should
happen in the critical path of the release at this point.
Touches #70562.