-
Notifications
You must be signed in to change notification settings - Fork 164
Sensitive Data Handling #100
Changes from all commits
bc78353
b95e94c
509e83b
7349386
f142c93
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# Sensitive Data Handling | ||
|
||
By default, OpenTelemetry libraries never capture potentially-sensitive data, except for the full URL. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are "OpenTelemetry libraries"? Are these the API + SDK implementations or instrumentations/integrations that are provided by the OpenTelemetry project? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would include any current and future auto-instrumenter (like Java, .NET, node.js) as well as any manually wrapped libraries(e.g., in a contrib package). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. It would make sense to explicitly mention the scope to which this OTEP and its design goals apply to in the document. |
||
|
||
## Motivation | ||
|
||
Many businesses are under regulatory or contractual constraints on how they process sensitive data (e.g., GDPR, | ||
PCI, HIPPA). If we are "safe by default" and manage this issue well, it will help greatly with | ||
OpenTelemetry adoption in many markets. | ||
|
||
## Explanation | ||
|
||
By default, our java auto-instrumentation library does not capture your app's credit card numbers, taxpayer IDs, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the Java auto-instrumentation mentioned as an example or does this only apply to the Java auto-instrumentation? Please clarify. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sticking with the sql example, it would also apply to the node.js instrumentation in opentelemetry-js, including the opentelemetry-plugin-mysql and opentelemetry-plugin-postgres packages. It doesn't look like the nascent .NET auto agent would be affected (yet). |
||
personal medical data, users' passwords, etc. Unless you put them in the URL. If you have put them in the URL, we provide a way to fix that. | ||
|
||
How do we do that? For example, we don't know the semantics of the SQL your app sends to its database(s). Some of the data might be | ||
sensitive, and some of it might not. What we do is remove all | ||
*potentially* sensitive values. This means that sql like `SELECT * FROM USERS WHERE ID=123456789` gets turned into | ||
`SELECT * FROM USERS WHERE ID=?` when it gets put into our `db.statement` attribute. Since our library doesn't know anything about the | ||
meaning of `ID` values, we decided to be "safe by default" and drop all numeric and string values entirely from all SQL. So what gets | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this work for all sorts of SQL dialects out there? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, a sql lexer can easily broadly apply to many vendors' dialects, because it's only interested in tokenization and not in syntax/grammar. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't see how this can possibly be done efficiently inside the language SDKs, as opposed to scrubbing this in the ingestion pipeline. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Only seeing this otep now as it was linked to from another newer otep -- I'd like to see something like this added so going over this one now. I don't think this is a good example since the db statement should already be parameterized, like an extended query in postgres. Are there many databases that don't support a form of query + args like found in SQL databases? So for db instrumentation we should document to the user that they should not manually construct queries with dynamic data in general but most importantly for the case of instrumentation, to not do it with sensitive data. And instead the query parameter attributes need to be either not included at all or have the ability to be filtered by key. This would require no SQL parsing library and be doable in all of the implementations. Of course best practices aren't always an option so having the SQL parsing that can scrub all predicates in the collector would be a must have for some. I also think this relates to span processors and how they need updating to provide functionality like this. I planned for that to be a follow up that builds on my OTEP #186 but since it has not been met with a positive reception so far I may have to rethink that :) This OTEP is about a "broad agreement" and I agree with that but put the rest here because I think the broad agreement should include span transformations in the export pipeline that doesn't yet exist, and that one of those transformations should be a required configurable attribute scrubber. |
||
captured in our system is just table names, field names, clause structures, etc. (metadata about your data's structure and how you access it). | ||
|
||
Yes, this also has the pleasant side effect of reducing the cardinality of these dimensions, making it easier to build metrics or indices on them. | ||
If you want to reduce it further, our collector has features for that. Again, our design goal is to protect you from | ||
liability or exposure due to captured sensitive data. | ||
|
||
We use similar techniques for all other protocols/query/command mechanisms. One exception is how we treat URLs. We will by | ||
default capture the full URL (including the query and fragment) in `http.url`. Note that this will not include | ||
the contents of form submissions or XHR posts. Since many products/features log the entire URL already | ||
(like in Apache's access log), developers already need to consider this. If you are putting sensitive data in the URL, | ||
we offer configuration options to work around this. | ||
|
||
If your app doesn't process any sensitive data at all and you want to see the raw sql values, you can configure that. The documentation | ||
will have some bold/red warnings telling you to be sure you know what you're doing. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this something to be configured in the UI of a backend or at the place where the instrumentation is set up? Are these bold red warnings supposed to show up in a log produced by the instrumentation or in the tracing backend's UI? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Config would remain local to each library (a manual wrapper might provide a dictionary of options as an API parameter, an auto agent might use an environment variable or config file). The warning would appear in the documentation of this config knob. |
||
|
||
I'd like to repeat here that I'm using SQL as an *example* of how we treat potentially-sensitive data in our ecosystem. We treat every other protocol/query format | ||
with the same care. | ||
|
||
## Internal details | ||
|
||
I'm looking here for broad agreement of a design principle, rather than feedback on any specific low-level string manipulation technique to | ||
achieve these goals. | ||
|
||
That said, I have worked on sql normalization at three prior APM companies and am working on contributing a simple first version of one for the | ||
opentelemetry-auto-instr-java repo. It is based on using a lexer to parse out sql numeric and string literals and replacing them with a `?`, exactly | ||
as described above and done by many APM products on the market. | ||
|
||
One interesting technical point to raise is what to do when our normalization/scrubbing knows it isn't working properly (for | ||
example, encountering an ambiguous string it doesn't know how to parse). In some cases a safe fallback approach can be found (e.g., apply a | ||
less-sophisticated regex that bluntly removes too much from the query), but in other cases we may need to, e.g., throw an internal exception and | ||
interpret that as a requirement to not add the | ||
attribute in the first place. In pseudo-code: | ||
``` | ||
try { | ||
span.setAttribute("sensitive.field", normalize(sensitiveValue)); | ||
} catch (CannotNormalizeException e) { | ||
// don't emit sensitive.field since we can't normalize it safely | ||
} | ||
``` | ||
|
||
This work will affect all auto-instrumenting libraries, as well as any official (or recommended) manually wrapped libraries that are | ||
produced as part of the ecosystem. As appropriate to each such library, features designed to properly handle sensitive data should | ||
be highlighted in the documentation. Sticking with the SQL example, for the Java auto library this could take the form of an additional column | ||
on the "Supported Frameworks" table with a hyperlink to the documentation describing exactly what we do: | ||
|
||
| Library/Framework | Versions | Notes | | ||
|------------------------------------------------------------------------------------------------------|--------------------------------|--------------------------------------------| | ||
| [JDBC](https://docs.oracle.com/en/java/javase/11/docs/api/java.sql/java/sql/package-summary.html) | Java 7+ | (link to docs about sql normalization) | | ||
|
||
|
||
## Trade-offs and mitigations | ||
|
||
In some cases, proper "safe by default" behavior might involve more intensive processing than can reasonably be performed inline with the application. There have | ||
already been some discussions about having span (and attribute) processing logic run in-process but off-thread in the application (i.e., by a configurable span | ||
processing pipeline, which would use separate threads for its work). For sql normalization (my first/example concern), I don't believe this is necessary, but | ||
further work on this issue may put pressure here. | ||
|
||
One drawback of this "safe by default" approach is that it puts a configuration burden on some users who are not handling sensitive data but | ||
want to have as much visibility as possible into their system. Staying with the sql normalization example however, the vast majority of sql in the world doesn't | ||
have different performance characteristics based on the exact values passed in, but on the number of them, or the complexity of the `WHERE` clauses, | ||
the number of values in an `IN` list, presence of an index on fields, etc. | ||
|
||
This "safe by default" behavior should be applied for a "default install", including the collector. While individual instrumentation | ||
libraries can (and are encouraged to) apply appropriate scrubbing themselves, the collector will act as a cross-language backstop, with appropriate | ||
semantic transformations based on received data (e.g., looking at the`db.type` field to determine whether to treat a string | ||
as sql or a mongo query). This approach should reduce maintenance costs across the ecosystem as well as reduce pressure on needing | ||
the in-app span processing pipeline mentioned above. | ||
|
||
## Prior art and alternatives | ||
|
||
Nearly every commercial APM product on the planet has features like the sql normalization discussed above (though some take a different approach in the name of reduced | ||
cardinality). Most of them have landed on a similar answer for their treatment of URLs (though, again, some do different things for metrics features vs. tracing features). | ||
|
||
## Open questions | ||
|
||
I am not personally familiar with the OpenTelemetry stance on one value that is potentially-sensitive under some regulatory frameworks: client IP addresses | ||
of users accessing a system. This should be discussed, perhaps separately. | ||
|
||
I am not an expert at all the protocols/query formats that our libraries support. Doing this well will require a thorough audit and cooperation/discussion across | ||
the community (again, ideally in separate issues/PRs for each such feature). | ||
|
||
## Future possibilities | ||
|
||
We might want to migrate some of this functionality into external libraries such that manual instrumentation (and manually-constructed library wrappers) can take advantage of these capabilities. | ||
|
||
Additionally, work on more advanced configuration around this puts pressure on having our span processors be highly flexible, whether these are in-process or in the collector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a pretty bold statement. Are we sure we will be able to ensure this 100%?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a design goal. Bugs are probably inevitable 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think armin was asking if this is a design goal we want. The term
libraries
could mean anything from API to SDK to Instrumentation Libraries. I think the strongest statement we could make would be: