Semantic conventions vs GDPR #128

pellared · 2023-06-22T18:21:35Z

Per https://github.com/open-telemetry/semantic-conventions/blob/9b455310519ec511656f91d1db0e30f5e32acd2a/specification/trace/semantic_conventions/http.md#http-client

url.full is currently Required.

However, the URL can contain sensitive data e.g. personal data (PII) in GDPR terminology (e.g. login, ID).

GDPR adds many rights to the subjects a lot of rights that may be very problematic (e.g. https://www.digitalguardian.com/blog/google-fined-57m-data-protection-watchdog-over-gdpr-violations).

Maybe it should be Recommended similarly to device.id which also has the following notice:

semantic-conventions/specification/resource/semantic_conventions/device.md

Line 17 in 9b45531

    
           **[1]:** The device identifier MUST only be defined using the values outlined below. This value is not an advertising identifier and MUST NOT be used as such. On iOS (Swift or Objective-C), this value MUST be equal to the [vendor identifier](https://developer.apple.com/documentation/uikit/uidevice/1620059-identifierforvendor). On Android (Java or Kotlin), this value MUST be equal to the Firebase Installation ID or a globally unique UUID which is persisted across sessions in your application. More information can be found [here](https://developer.android.com/training/articles/user-data-ids) on best practices and exact implementation details. Caution should be taken when storing personal data or anything which can identify a user. GDPR and data protection laws may apply, ensure you do your own due diligence.

?

Maybe should add some notice that the URL can contain sensitive/personal data and one may consider to delete it using OTel Collector's attributesprocessor?

Maybe the collection of this attribute should be configurable?

The same concerns apply to db.statement.

PS. I am sorry that the issue has questions than answers.

The text was updated successfully, but these errors were encountered:

pellared · 2023-06-23T10:28:17Z

What I suggest is:

Identify attributes that has probability (which is not very low) to contain PII (Personally Identifiable Information) e.g. url.full, url.path, url.query client.address, client.socket.address, db.statement.
For each of these:
2.1. Add caution note e.g. The value may store sensitive data (e.g. personal data, personally identifiable information). GDPR and data protection laws may apply, ensure you do your own due diligence.
2.2. Make sure that it is NOT Required. So that people who do not use collector may not collect them.

We could also consider adding some env var like OTEL_SEMCONV_HARDEN_OPT_IN to make it easier to disable such attributes.

Oberon00 · 2023-06-23T10:57:43Z

I see you had already contributed in that area in the past: open-telemetry/opentelemetry-specification#1502
Could it be that these considerations got lost in the current rewrite?
CC @lmolkova

pellared · 2023-06-23T11:07:34Z

@Oberon00 In open-telemetry/opentelemetry-specification#1502 I was mostly concerned about leaking login+password (I find it a lot more critical).

I felt that mentioning GDPR/sensitive data would not bring a lot of value 2 years ago as semantic conventions were in a very early stage. I was afraid that it would cause more confusion and paralysis. Right now I see a lot of contributions in semantic conventions. I think that it is a good moment to start doing something with privacy/GDPR/data protection.

reyang · 2023-06-23T20:30:49Z

I want to add one point - exception callstacks are commonly considered as privacy data.

pellared · 2023-06-23T20:30:59Z

Today, I learned that for one of our customer uses client (mobile) instrumentation (Android, iOS). The PII data emitted via HTTP Client instrumentation is very problematic for them. Because the instrumentation works on "end-user" devices therefore it is not easy to get rid of them. Ideally they would prefer to not emit them from the devices at all (via network).

pellared · 2023-06-23T20:37:48Z

I want to add one point - exception callstacks are commonly considered as privacy data.

I think it usually leaks "internal details" of the instrumented system (most telemetry is causing leakage of some internal details). I do not think it would be easy that they leak "personal data" via exception callstack UNLESS they also contain parameter values. I am not aware of any language/ecosystem which dumps parameter values in exception callstacks.

reyang · 2023-06-23T20:39:15Z

@Oberon00 In open-telemetry/opentelemetry-specification#1502 I was mostly concerned about leaking login+password (I find it a lot more critical).

One extra point - login/password leak can happen anywhere. I'll give an example from Microsoft Azure - in many places we allow users to put "tags" (very generic, arbitrary strings that can be associated with some entities) which can be used to group or search things. And we know that some users could put sensitive information there (e.g. it could be their emails or even passwords), so we consider all of these privacy data and put lots of efforts on redaction, classification, isolation and access control.

Another typical issue is that people might put something wrong by mistake when they instrument, in a large system we do see developers making mistakes (e.g. putting user email address in an attribute named "ResourceType"). I feel schema cannot solve these problems, a centralized scanning/redaction system can provide consistent/reliable guarantee (which comes with perf cost for sure).

pellared · 2023-06-23T20:44:26Z

I feel schema cannot solve these problems

For sure it cannot solve it. It is more about Defense in Depth and making the hardening more straightforward and adding protection on more layers.

utezduyar · 2023-06-24T05:12:08Z

I really like the idea of env. variable or something similar to do best effort of anonymizing PII data. I think it is crucial to align on the best effort of not sending this data outside of client devices.

pellared · 2023-06-26T07:10:22Z

We could also consider adding some configuration/feature to the SDK that would allow attribute retraction (e.g. via exporter decorator). This could help user's were retraction using OTel Collector's attributesprocessor is problematic (like here). But without additional hints the user may still have trouble to find telemetry which has "not low" probability to contain PII or other sensitive data.

pellared · 2023-07-27T16:04:31Z

I had a conversation with @trask, and here is the summary:

Our current focus lies on the HTTP semantic convention as it's planned to be stable and can serve as an example for future semantic conventions.

During our discussion, we identified the following attributes as problematic:

url.full
url.path
url.query

Regarding url.path and url.query, we both agreed that these attributes could be changed from Required to Recommended. We propose adding a note like: The value SHOULD be captured by default. The value may store sensitive data. GDPR and data protection laws may apply, ensure you do your own due diligence. Instrumentations SHOULD offer a way to not capture this attribute.

However, for url.full, we haven't reached a satisfactory solution yet. We discussed two possible options:

A) Apply the same approach as for url.path and url.query. Nevertheless, this would cause the telemetry to currently miss information, such as the URL scheme.

B) Implement an opt-in functionality to scrub/retract the path and query parts from the URL. However, this leads that most of the data is redundant (except for the missing URL scheme).

Personally, I lean towards option A for since I don't consider URL scheme to be critical telemetry. Moreover, we can always add url.scheme to HTTP Client later, for example, as Recommended or even Required. Lastly, adding additional scrub/retraction functionality would be more complex and bug-prone than simply not collecting an attribute.

On the other hand, we are both not sure if such attribute retraction should be at the instrumentation level, we both think that maybe the SDK should offer a more general way to suppress specific attributes. Maybe SpanProcessors (rather impossible) for spans and Views (it only contains "allow-list", no "deny-list") for metrics are the way to go? EDIT: I got a feedback from @MrAlias that doing filtering on higher level would lead the resource consumption overhead that may not be acceptable. Not collecting unwanted attributes on instrumentation level would be more efficient.

jsuereth · 2023-07-31T17:15:36Z

I'd like to have a broader OTEL-wide discussion on handling GDPR concerns. Specifically:

I'd like to separate the concern of semantic convention instrumentation author recommendations and providing users a solution for o11y in context of GDPR. This means that I think from a semconv perspective we should focus on outling where sensitive data can be written using something like a sensitivity annotation in the YAML.
I think we should have SDK/collector features that allow users to customize their GDPR story w/ o11y. There is not a 1-size fits all solution. Just removing the data is the easiest but not necessarily the best. A compliant o11y datastore for GDPR could also work, depending on how/where the data is sent and stored. As such, we should give users flexibility.

I'm going to raise the discussion in the next TC meeting to see if we can get alignment on a direction here and what thoughts exist from others. I suggest this is worth some good brainstorming.

pellared · 2023-08-09T12:04:42Z

Related issue open-telemetry/opentelemetry-go-contrib#3895

pellared · 2023-09-05T07:11:09Z

@jsuereth Any update?

pellared · 2023-09-25T11:42:13Z

I just spotted

In some semantic conventions, the data collected as a span attribute could
contain PII (Personally Identifiable Information). As a general guideline, do
not collect this data by default.

in open-telemetry/opentelemetry.io#3309

trask · 2023-10-10T14:39:46Z

related, we have had PII data reported a couple of times now in exception.message: open-telemetry/opentelemetry-java-instrumentation#3039

jsuereth · 2023-10-23T16:41:51Z

I think, for o11y, it's impossible to NOT collect PII. At this point our approach should be the following:

Let's make sure we can apporpriately FLAG things that could be PII with annotations.
Otel Users need CHOICE for whether they want PII generated at the instrumentation level to leave that machine. This means SDK features or Collector features to ensure redaction of PII where necessary.
We expect users of OpenTelemetry to understand that it is better to provide backends systems compliant with GDPR rather than give up all aspects of o11y that are needed for debugging (e.g. URL). In lieu of this, the CHOICE to redact should be sufficient. Both situations will be provided for via annotations.

The guidance to not generate PII by default is temporary, pending better enable/disable features in the SDK for full user control.

pellared · 2023-10-30T15:46:54Z

SIG meetings notes.

It appears that the general agreement on how to handle the problem is as follows:

Include a marker for the attributes that might contain personally identifiable information (PII) or sensitive data. This way we clearly can document which attributes may need special care.
Establish the _RETRACTED value, which can be employed by instrumentation libraries to remove attribute values. This way there is still a value for the Required attributes and moreover it is transparent what attributes are retracted.

pellared · 2023-10-30T21:09:09Z

After some thoughts, I find

point 1 from Semantic conventions vs GDPR #128 (comment)
as well first bullet from Semantic conventions vs GDPR #128 (comment)
and first bullet from Semantic conventions vs GDPR #128 (comment)

are good enough for solving the issue.

It is also inline with my original proposal: #128 (comment)

github-actions bot assigned reyang Jun 22, 2023

pellared changed the title ~~url.full and db.statement attributes vs GDPR~~ Semantic conventions vs GDPR Jun 23, 2023

pellared mentioned this issue Jun 23, 2023

Mark initial set of HTTP semantic conventions as frozen #105

Merged

trask mentioned this issue Oct 23, 2023

Capture identity attributes (enduser.*) for applications using Spring Security open-telemetry/opentelemetry-java-instrumentation#9400

Open

joaopgrassi mentioned this issue Nov 2, 2023

Document/warn about cardinality issue with attributes where input "comes from the user" #485

Open

trask added this to Database Client Semantic Conventions Feb 7, 2024

github-actions bot added the Stale label Feb 17, 2024

joaopgrassi removed the Stale label Feb 26, 2024

vishweshbankwar mentioned this issue Mar 29, 2024

Http client and server span default collection behavior for url.full and url.query attributes #860

Open

lmolkova mentioned this issue Apr 3, 2024

Guidelines for redacting sensitive information #877

Open

pellared mentioned this issue Apr 17, 2024

Instrumentation.AspNetCore and Instrumentation.Http 1.8.1 have illegal breaking changes in the produced telemetry open-telemetry/opentelemetry-dotnet#5541

Closed

trask moved this to Post Stability in Database Client Semantic Conventions Apr 24, 2024

trask mentioned this issue Apr 24, 2024

Evaluate requirement level of db.statement #754

Open

pellared mentioned this issue Oct 28, 2024

Sensitive Data Redaction open-telemetry/oteps#255

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic conventions vs GDPR #128

Semantic conventions vs GDPR #128

pellared commented Jun 22, 2023 •

edited

Loading

pellared commented Jun 23, 2023 •

edited

Loading

Oberon00 commented Jun 23, 2023

pellared commented Jun 23, 2023 •

edited

Loading

reyang commented Jun 23, 2023

pellared commented Jun 23, 2023

pellared commented Jun 23, 2023

reyang commented Jun 23, 2023

pellared commented Jun 23, 2023 •

edited

Loading

utezduyar commented Jun 24, 2023

pellared commented Jun 26, 2023 •

edited

Loading

pellared commented Jul 27, 2023 •

edited

Loading

jsuereth commented Jul 31, 2023

pellared commented Aug 9, 2023

pellared commented Sep 5, 2023

pellared commented Sep 25, 2023

trask commented Oct 10, 2023

jsuereth commented Oct 23, 2023

pellared commented Oct 30, 2023

pellared commented Oct 30, 2023

Semantic conventions vs GDPR #128

Semantic conventions vs GDPR #128

Comments

pellared commented Jun 22, 2023 • edited Loading

pellared commented Jun 23, 2023 • edited Loading

Oberon00 commented Jun 23, 2023

pellared commented Jun 23, 2023 • edited Loading

reyang commented Jun 23, 2023

pellared commented Jun 23, 2023

pellared commented Jun 23, 2023

reyang commented Jun 23, 2023

pellared commented Jun 23, 2023 • edited Loading

utezduyar commented Jun 24, 2023

pellared commented Jun 26, 2023 • edited Loading

pellared commented Jul 27, 2023 • edited Loading

jsuereth commented Jul 31, 2023

pellared commented Aug 9, 2023

pellared commented Sep 5, 2023

pellared commented Sep 25, 2023

trask commented Oct 10, 2023

jsuereth commented Oct 23, 2023

pellared commented Oct 30, 2023

pellared commented Oct 30, 2023

pellared commented Jun 22, 2023 •

edited

Loading

pellared commented Jun 23, 2023 •

edited

Loading

pellared commented Jun 23, 2023 •

edited

Loading

pellared commented Jun 23, 2023 •

edited

Loading

pellared commented Jun 26, 2023 •

edited

Loading

pellared commented Jul 27, 2023 •

edited

Loading