Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: make all ways to associate data with a Span redactable #58610

Closed
tbg opened this issue Jan 7, 2021 · 21 comments
Closed

tracing: make all ways to associate data with a Span redactable #58610

tbg opened this issue Jan 7, 2021 · 21 comments
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-logging In and around the logging infrastructure. A-tracing Relating to tracing in CockroachDB. branch-release-21.2 Used to mark GA and release blockers, technical advisories, and bugs for 21.2 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Jan 7, 2021

Is your feature request related to a problem? Please describe.
Tracing information can be sent to the SQL root user, debug.zip, and external tracers. It is desirable to apply our budding redactability best practices here as well.

Describe the solution you'd like
Make all ways to associated data with a Span redactable. This primarily includes free-form text messages (log.Event and co) as well as structured payloads attached to a Span. The former should get the treatment our regular logging APIs have already received, and the latter should be required to implement SafeFormatter.

Other data associated with spans are the operation name, tags, log tags, and baggage items. Tags in particular tend to contain PII and thus need redaction.

Describe alternatives you've considered

Additional context

Jira issue: CRDB-3379

@tbg tbg added the A-tracing Relating to tracing in CockroachDB. label Jan 7, 2021
@blathers-crl
Copy link

blathers-crl bot commented Jan 7, 2021

Hi @tbg, I've guessed the C-ategory of your issue and suitably labeled it. Please re-label if inaccurate.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 7, 2021
@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Jan 7, 2021
@jlinder jlinder added the T-server-and-security DB Server & Security label Jun 16, 2021
@knz knz added A-logging In and around the logging infrastructure. A-cc-enablement Pertains to current CC production issues or short-term projects labels Jul 29, 2021
@knz
Copy link
Contributor

knz commented Aug 16, 2021

I understand this was considered to fall under obs infra.
cc @maryliag

@tbg
Copy link
Member Author

tbg commented Sep 20, 2021

This has popped up on my radar again in the context of #70406. I don't know how urgent it is, but at some point I suspect the serverless UX will require redacted traces so that a) we can be sure we don't leak PII to SQL tenants and b) we can continue to rely on tracing to troubleshoot problems.
It's unclear (to me) what weight b) carries. Users generally don't use the KV-level details in verbose traces very much (I think) and if so, then only in conjunction with our support teams. However, if our support term generally has to "impersonate" a tenant to get a trace for an operation it is having trouble with, and that trace is completely clipped (as in #70406), then the traces will not provide they value they do today for on-prem clusters. In other words, whatever tools we use to make customers successful, they should work on Serverless. This one, as of #70406, is lagging behind.
cc @andy-kimball do decide on what timelines changes need to be implemented here. Obs-inf will likely have to do some hard thinking here. Is the right solution to extend the concept of redactability to include the tenantID (i.e. you can say something like: this piece of information is redacted unless the recipient is tenant X) or to simply rely on KV-side tracing less? Today I think it is mostly used for timing information ("why is there such a gap between those two lines"), and perhaps regularly redacted data is already just fine. And, this being serverless, maybe there's a third way where the use of SQL session tracing isn't even something we want to further encourage.

edit: some (internal) discussion in https://cockroachlabs.slack.com/archives/C01CDD4HRC5/p1631905968047900

@andy-kimball
Copy link
Contributor

I'm not sure what you mean by "decide on what timeline changes need to be implemented here"?

@tbg tbg added branch-release-21.2 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 21, 2021
@tbg
Copy link
Member Author

tbg commented Sep 21, 2021

Discussed privately on Slack with @aaron-crl:

  • we'll patch free tier via regular scheduled upgrade
  • paid tier needs to pick up a beta that has the fix (hopefully beta.2, but perhaps beta.3)

I'll prioritize getting #70406 merged and backported to release-21.2 branch as well as 21.1.

tbg added a commit to tbg/cockroach that referenced this issue Sep 22, 2021
This commit changes the `(*Node).Batch` endpoint to redact
trace recordings that were collected on behalf of a tenant.

This notably does not affect the structured payloads. We
need at least one of them - ContentionEvent - to remain intact
as the keys within are used by SQL observability.

Touches cockroachdb#58610. It does not close it because there may be other
endpoints that support tracing, and even if there aren't, there
is nothing that keeps us from accidentally re-introducing this
problem again in the future.

Release note: None
@tbg
Copy link
Member Author

tbg commented Sep 23, 2021

I'm going to go ahead with #70562. After that PR is merged, we still have to develop a story for structured payloads. At the time of writing, (*Span).RecordStructured has dozens of callers. Unfortunately, the problem is not as easy as mandating that all payloads implement redact.SafeFormatter. For example, the key in a ContentionEvent are expected to reach the tenant to enable its SQL Observability features centered around contention, but redaction would prevent that. This either means some complication in what we ask of redaction - for example, the redaction step could be aware of the tenantID and would know to leave keys owned by this tenant intact - or a reworking of the upstream consumer features to work with redacted data. (Note that keys can be partially redacted, we can keep the descriptor and index IDs intact, stripping just the row-level data).

@tbg
Copy link
Member Author

tbg commented Sep 23, 2021

An additional question that came up in the PR is around redaction when logging to external opentracing tracers. As of #70562, no redaction is performed for these backends, but ultimately we may want some. This seems of lower priority though, lest we start wanting to do that in Serverless.

tbg added a commit to tbg/cockroach that referenced this issue Sep 23, 2021
This commit changes the `(*Node).Batch` endpoint to redact
trace recordings that were collected on behalf of a tenant.

This notably does not affect the structured payloads. We
need at least one of them - ContentionEvent - to remain intact
as the keys within are used by SQL observability.

Touches cockroachdb#58610. It does not close it because there may be other
endpoints that support tracing, and even if there aren't, there
is nothing that keeps us from accidentally re-introducing this
problem again in the future.

Release note: None
blathers-crl bot pushed a commit that referenced this issue Sep 27, 2021
This change removes the string representation of the span
being exported from the structured recording written to the
tracing span.

The change was motivated by discussions around redactabilty
and tenants being able to see unredacted traces from the KV
host server. More information can be found
#58610.

These traces are not very critical and so until there is a
larger change for structured recordings to support redactability
we can do away with them.

Informs: #58610

Release note: None
@dhartunian dhartunian changed the title tracing: redactability tracing: make all ways to associate data with a Span redactable Oct 4, 2021
andy-kimball pushed a commit that referenced this issue Oct 6, 2021
This commit changes the `(*Node).Batch` endpoint to redact
trace recordings that were collected on behalf of a tenant.

This notably does not affect the structured payloads. We
need at least one of them - ContentionEvent - to remain intact
as the keys within are used by SQL observability.

Touches #58610. It does not close it because there may be other
endpoints that support tracing, and even if there aren't, there
is nothing that keeps us from accidentally re-introducing this
problem again in the future.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 15, 2021
Touches cockroachdb#58610. The preceding commits reverted redactability of traces,
so we need to find another way to prevent sensitive information from
leaking to tenants. At this point, the method is crude: just don't
return any trace recording to tenants at the KV boundary. This is a loss
of observability that could backfire, but we can improve upon it.

Note that we have coverage of this through TestTenantTracesAreRedacted.

Release note: None
@tbg tbg added the GA-blocker label Oct 15, 2021
@tbg
Copy link
Member Author

tbg commented Oct 15, 2021

Unfortunately I have to re-add GA-blocker since the redactability work that originally addressed it is causing performance issues and will need to be reverted/mitigated.

Discussions about the options is here: #71610 (comment)
There is partial work, but either approach will not be done before mid-next week at the earliest.

@exalate-issue-sync exalate-issue-sync bot removed the T-server-and-security DB Server & Security label Oct 18, 2021
tbg added a commit to tbg/cockroach that referenced this issue Oct 19, 2021
Touches cockroachdb#58610. The preceding commits reverted redactability of traces,
so we need to find another way to prevent sensitive information from
leaking to tenants. At this point, the method is crude: just don't
return any trace recording to tenants at the KV boundary. This is a loss
of observability that could backfire, but we can improve upon it.

Note that we have coverage of this through TestTenantTracesAreRedacted.

Release note: None
@tbg
Copy link
Member Author

tbg commented Oct 19, 2021

Now that #71606 is merged, here's where we are with release-21.2:

  • the performance is ok, and tenants don't receive sensitive data, so no GA blocker
  • observability is reduced: we cannot use traces to learn internals from the KV layer, which may bite us as we support serverless.

On master, the redactability work is still present, including the performance problems. I'm going to open a separate issue about that.

@tbg
Copy link
Member Author

tbg commented Oct 19, 2021

Filed the issue for master: #71694

craig bot pushed a commit that referenced this issue Oct 19, 2021
70792: server: expect duplicate session cookies on HTTP requests r=catj-cockroach,knz a=dhartunian

Previously, the golang HTTP library was used to retrieve a cookie with
the specific name we use to store our authentication token. This API
method retrieves the *first* matching cookie in the header and ignores
the rest. This can cause issues with requests that contain multiple
matching cookies with the same name (if other applications are
potentially running on the same domain).

This change adjusts the HTTP API we use to retrieve cookies (we now
prefer to call `req.Cookies()` and iterate through the entire list
ourselves, instead of relying on the API to give us the matching cookie
since it only supports returning a single result.

Resolves #70376.

Release note (security update): Authenticated HTTP requests to nodes can
now contain additional cookies with the same name as the one we use
("session"). Duplicates like this are permitted by the HTTP spec and our
code will now attempt to parse all cookies with a matching name before
giving up. This can resolve issues with customers who run other services
on the same domain as their CRDB nodes.

71503: server: Tenant pods fan out to cancel SQL queries r=matthewtodd a=matthewtodd

Partially addresses #70832.

Previously, the tenant `/_status/cancel_query` endpoint was not
available over HTTP and could not cancel a query not running locally.
This change opens it up, so that we can call it from the cloud console,
and makes sure that that we can cancel queries when a tenant has more
than one SQL pod running.

Release note (api change): The tenant `/_status/cancel_query` endpoint
was made available over HTTP and augmented to fan out requests in a
multi-pod world.

71677: row: pass in account instead of monitor for KV fetcher r=yuzefovich a=yuzefovich

This commit switches the KV fetcher to expect a memory account rather
than a memory monitor. It additionally creates a separate memory account
for the KV fetcher when it is used by the cFetcher (previously, we were
reusing the monitor from the allocator).

Note that we don't need to update multiple places where we call methods
on possibly nil memory account because all methods have been taught to
be noops when the receiver is `nil`.

Fixes: #55481.

Release note: None

71698: roachtest: add kv95/enc=false/nodes=3/tracing r=[dhartunian,stevendanna] a=tbg

We don't have a good grasp on what it means for performance to turn on
tracing globally. A kv95 (read-mostly) workload is a good baseline here
as we expect it to be among the most severely impacted.

Touches #71694.
Touches #58610.

Release note: None


Co-authored-by: David Hartunian <davidh@cockroachlabs.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 1, 2021
This reverts commit fc95ad8.

This PR restores the ability for tenant traces to see unstructured
recordings from the KV layer.

Please see note in cockroachdb#71694: cockroachdb#71694 (comment)

The need to enforce such strict redaction of traces for tenant queries
is not necessary. Customer views into the KV layer will be necessary in
order to debug query execution in multi-tenant settings. Redaction is
certainly needed at this boundary but must be configured more carefully.
Further PRs as necessitated by cockroachdb#58610 and cockroachdb#71694 will tackle these
issues on master (where trace redaction is still available) and will be
backported to the release branch as needed.

This PR restores tracing into KV for tenants and unblocks the ability
for serverless to use the release-21.2 branch.

Release note (ops change): Tracing queries from tenants will now include
unstructured log messages from the kv layer.
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 3, 2021
This reverts commit 22ab4ea.

This commit changes the `(*Node).Batch` endpoint to redact
trace recordings that were collected on behalf of a tenant.

This notably does not affect the structured payloads. We
need at least one of them - ContentionEvent - to remain intact
as the keys within are used by SQL observability.

Touches cockroachdb#58610. It does not close it because there may be other
endpoints that support tracing, and even if there aren't, there
is nothing that keeps us from accidentally re-introducing this
problem again in the future.

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 8, 2021
This reverts commit fc95ad8.

Touches cockroachdb#58610. The preceding commits reverted redactability of traces,
so we need to find another way to prevent sensitive information from
leaking to tenants. At this point, the method is crude: just don't
return any trace recording to tenants at the KV boundary. This is a loss
of observability that could backfire, but we can improve upon it.

Note that we have coverage of this through TestTenantTracesAreRedacted.

Release note: None
dhartunian added a commit to dhartunian/cockroach that referenced this issue Dec 8, 2021
This reverts commit 22ab4ea.

This commit changes the `(*Node).Batch` endpoint to redact
trace recordings that were collected on behalf of a tenant.

This notably does not affect the structured payloads. We
need at least one of them - ContentionEvent - to remain intact
as the keys within are used by SQL observability.

Touches cockroachdb#58610. It does not close it because there may be other
endpoints that support tracing, and even if there aren't, there
is nothing that keeps us from accidentally re-introducing this
problem again in the future.

Release note: None
@exalate-issue-sync exalate-issue-sync bot added the branch-release-21.2 Used to mark GA and release blockers, technical advisories, and bugs for 21.2 label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cc-enablement Pertains to current CC production issues or short-term projects A-logging In and around the logging infrastructure. A-tracing Relating to tracing in CockroachDB. branch-release-21.2 Used to mark GA and release blockers, technical advisories, and bugs for 21.2 C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

7 participants