From 4f6c3a5fd387cf93c2af67fca550c439eea4eb69 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Fri, 10 May 2024 16:23:45 -0700 Subject: [PATCH 1/9] Draft invalid UTF-8 error handling policy --- text/0000-utf8-handling.md | 205 +++++++++++++++++++++++++++++++++++++ 1 file changed, 205 insertions(+) create mode 100644 text/0000-utf8-handling.md diff --git a/text/0000-utf8-handling.md b/text/0000-utf8-handling.md new file mode 100644 index 000000000..bcfd7b36c --- /dev/null +++ b/text/0000-utf8-handling.md @@ -0,0 +1,205 @@ +# Consistent position towards UTF-8 validation in OpenTelemetry + +The OpenTelemetry project has not taken a strong position on how to handle +invalid UTF-8 in its SDKs and collector components. + +## Motivation + +When handling of invalid UTF-8 is left unspecified, anything can +happen. As a project, OpenTelemetry should give its component authors +guidance on how to treat invalid UTF-8 so that users can confidently +rely on OpenTelemetry software/systems. + +## Explanation + +OpenTelemetry has existing [error handling +guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md) +which dictate the project's general posture towards handling. Users +expect telemety libraries to be harmless and safe to use, so that they +can confidently instrument their application without unnecessary risk. +Users also make this demand of the OpenTelemetry collector, which is +expected to safely transport telemetry data. + +OpenTelemetry's OTLP protocol, specifically dictates how [valid and +invalid UTF-8 strings should be mapped into attribute +values](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values), +but there is no guidance on how to treat many other `string` fields +found in the OTLP protocol. The Span `Name` field, for example, +SHOULD be a valid UTF-8 string according to the Protocol Buffer +`proto3`-syntax specification. What is expected from the SDK when a +presumptive UTF-8 field is actually invalid UTF-8? What is expected +from the Collector? Without a clear definition, users are likely to +be experience harm through data loss. + +This document proposes to update OpenTelemetry error handling +guidelines to require SDKs and Collectors to protect users from loss +of telemetry caused by string-valued fields that are not well-formed. +A number of options are available, all of which are better than doing +nothing about this problem. + +## Internal details + +Text handling is such a key aspect of programming languages that +practically, they all present different a approach. While most +languages provide guardrails that maintain a well-formed character +encoding in memory at runtime, there is usually a way for a programmer +handling bytes incorrectly to yield an invalid string encoding. + +These are honest mistakes. The OpenTelemetry project should declare +invalid UTF-8 as a non-error condition, consistent with its error +handling principles. + +> The API and SDK SHOULD provide safe defaults for missing or invalid arguments. + +This document proposes that OpenTelemetry's default approach should be +the one taken by the Rust `String::from_utf8_lossy` method, which +converts invalid UTF-8 byte sequences to the Unicode replacement +character, `�` (U+FFFD). Taking this approach by default will avoid +data loss for users when the inevitable happens. + +### Scenarios covered + +In an example scenario, a user creates a span with an invalid name. +Without further intervention, the trace SDK then exports this through +OTLP over gRPC. What happens next depends on which language, RPC +client library, and protocol buffer library is used. For this +document, we examined the official [Golang +gRPC](https://github.com/grpc/grpc-go) implementation and the [Rust +hyperium/tonic](https://github.com/hyperium/tonic) implementation. + +For demonstration purposes, gRPC libraries usually provide a Hello +World client and server program. This study modifies the Hello World +client available in each language. + +#### Golang client + +With a Golang gRPC client modified to send an invalid string, e.g., + +``` + var buf bytes.Buffer + buf.WriteString(*name) + buf.Write([]byte{0xff, 0xff}) + // ... + r, err := c.SayHello(ctx, &pb.HelloRequest{Name: buf.String()}) + ``` + +the client program exits: + +``` +2024/05/10 15:09:23 could not greet: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 +``` + +In this case, validation is performed on the client side. If a client +somehow bypasses UTF-8 validation, the server will respond with an +error. + +``` +2024/05/10 15:28:13 could not greet: rpc error: code = Internal desc = grpc: error unmarshalling request: string field contains invalid UTF-8 +``` + +It appears possible to configure gRPC and/or the protocol buffer +library to bypass UTF-8 validation, but by default gRPC-Go will +validate on both sides. + +#### Rust client + +With a Rust gRPC client modified to send an invalid string, e.g., + +``` + let bytes = vec![255, 255]; + + let request = tonic::Request::new(HelloRequest { + name: unsafe { String::from_utf8_unchecked(bytes) }, + }); +``` + +the client program exits: + +``` +Error: Status { code: Internal, message: "failed to decode Protobuf message: HelloRequest.name: invalid string value: data is not UTF-8 encoded", metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Fri, 10 May 2024 16:10:46 GMT", "content-length": "0"} }, source: None } +``` + +In this case, validation is only performed on the server side. + +#### OpenTelemetry Collector `pdata` + +The OpenTelemetry Collector uses Gogo protobuf package to unmarshal +the structs that underly its `pdata` objects. With the "gogofaster" +compiler option is selected, its `Unmarshal()` logic does not check +for valid UTF-8 strings. + +The potential for user harm is clear at this point. gRPC libraries +may intentionally bypass UTF-8 validation in some cases, and this may +go unnoticed until a small amount of invalid UTF-8 data leads to data +loss in a different part of the system where validation is strictly +performed. When this happens in a telemetry pipeline, it can lead +loss of valid data that was inadvertently combined with invalid data, +possibly through batching. + +### Responsibility to the user + +When a user uses a telemetry SDK to gain observability into their +application, it can lead to quite the opposite result when invalid +UTF-8 causes loss of telemetry. When a user turns to a telemetry +system to obseve their system and makes a UTF-8 mistake, the +consequent total loss of telemetry is a definite bad outcome. For the +user to be well-served in this situation, OpenTelemetry SDKs and +Collectors SHOULD correct invalid UTF-8 using the Unicode replacement +character, `�` (U+FFFD) as a replacement for all invalid substrings. + +We take the position that OpenTelemety SDKs and Collectors SHOULD +explicitly make an effort to preserve telemetry data when +presumed-UTF-8 data is invalid, rather than allowing it to drop with +an error as a result. Accepting this proposal means this policy will +be added to the OpenTelemetry error handling guidelines. + +#### Give users what they want + +The OpenTelemetry API specification does not permit the use of +byte-slice valued attributes. This is one reason why invalid UTF-8 +arises, because OpenTelemetry users have not been given an +alternative. + +When faced with a desire to use record invalid UTF-8 in a stream of +telemetry, users have no good options. They can use a base64 +encoding, but they will have to do so manually, and with extra code, +and they are likely to learn this only after first making the mistake +of using a string-value to encode bytes data. When users see +corrupted telemetry data containing � characters, they will return to +OpenTelemetry and with a request, how should they report bytes data. + +Ironically, in a prototype Collector processor to repair invalid +UTF-8, we have used the Zap Logger's `zap.Binary([]byte)` field +constructor to report the invalid data, and this is something an +OpenTelemetry API user cannot do. + +### Alternatives considered + +There are scenarios, such as in the Rust SDK, where there is a strong +expectation of safety and correctness. In these settings, where the +use of `unsafe` is possible but rare and invalid UTF-8 is unlikely, it +can make sense to bypass UTF-8 validation in the client to save a few +CPU cycles. + +We could imagine a more permissive OpenTelemetry policy, in which SDKs +and Collectors are advised to treat all String-valued fields in the +OTLP protocol buffer as "presumptive" UTF-8, meaning data that SHOULD +be encoded as UTF-8 but that has not been checked for validity. Under +this policy, SDKs and Collectors would propagate invalid data and +leave it to the consumer to decide how to handle it. + +## Trade-offs and mitigations + +TODO + +## Prior art and alternatives + +TODO + +## Open questions + +TODO + +## Future possibilities + +TODO From 6df233db634c4bfd9e12dc0b6ffa64e048c20083 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Mon, 13 May 2024 14:58:37 -0700 Subject: [PATCH 2/9] update, lint, pr num --- text/0000-utf8-handling.md | 205 ---------------------------------- text/0257-utf8-handling.md | 223 +++++++++++++++++++++++++++++++++++++ 2 files changed, 223 insertions(+), 205 deletions(-) delete mode 100644 text/0000-utf8-handling.md create mode 100644 text/0257-utf8-handling.md diff --git a/text/0000-utf8-handling.md b/text/0000-utf8-handling.md deleted file mode 100644 index bcfd7b36c..000000000 --- a/text/0000-utf8-handling.md +++ /dev/null @@ -1,205 +0,0 @@ -# Consistent position towards UTF-8 validation in OpenTelemetry - -The OpenTelemetry project has not taken a strong position on how to handle -invalid UTF-8 in its SDKs and collector components. - -## Motivation - -When handling of invalid UTF-8 is left unspecified, anything can -happen. As a project, OpenTelemetry should give its component authors -guidance on how to treat invalid UTF-8 so that users can confidently -rely on OpenTelemetry software/systems. - -## Explanation - -OpenTelemetry has existing [error handling -guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md) -which dictate the project's general posture towards handling. Users -expect telemety libraries to be harmless and safe to use, so that they -can confidently instrument their application without unnecessary risk. -Users also make this demand of the OpenTelemetry collector, which is -expected to safely transport telemetry data. - -OpenTelemetry's OTLP protocol, specifically dictates how [valid and -invalid UTF-8 strings should be mapped into attribute -values](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values), -but there is no guidance on how to treat many other `string` fields -found in the OTLP protocol. The Span `Name` field, for example, -SHOULD be a valid UTF-8 string according to the Protocol Buffer -`proto3`-syntax specification. What is expected from the SDK when a -presumptive UTF-8 field is actually invalid UTF-8? What is expected -from the Collector? Without a clear definition, users are likely to -be experience harm through data loss. - -This document proposes to update OpenTelemetry error handling -guidelines to require SDKs and Collectors to protect users from loss -of telemetry caused by string-valued fields that are not well-formed. -A number of options are available, all of which are better than doing -nothing about this problem. - -## Internal details - -Text handling is such a key aspect of programming languages that -practically, they all present different a approach. While most -languages provide guardrails that maintain a well-formed character -encoding in memory at runtime, there is usually a way for a programmer -handling bytes incorrectly to yield an invalid string encoding. - -These are honest mistakes. The OpenTelemetry project should declare -invalid UTF-8 as a non-error condition, consistent with its error -handling principles. - -> The API and SDK SHOULD provide safe defaults for missing or invalid arguments. - -This document proposes that OpenTelemetry's default approach should be -the one taken by the Rust `String::from_utf8_lossy` method, which -converts invalid UTF-8 byte sequences to the Unicode replacement -character, `�` (U+FFFD). Taking this approach by default will avoid -data loss for users when the inevitable happens. - -### Scenarios covered - -In an example scenario, a user creates a span with an invalid name. -Without further intervention, the trace SDK then exports this through -OTLP over gRPC. What happens next depends on which language, RPC -client library, and protocol buffer library is used. For this -document, we examined the official [Golang -gRPC](https://github.com/grpc/grpc-go) implementation and the [Rust -hyperium/tonic](https://github.com/hyperium/tonic) implementation. - -For demonstration purposes, gRPC libraries usually provide a Hello -World client and server program. This study modifies the Hello World -client available in each language. - -#### Golang client - -With a Golang gRPC client modified to send an invalid string, e.g., - -``` - var buf bytes.Buffer - buf.WriteString(*name) - buf.Write([]byte{0xff, 0xff}) - // ... - r, err := c.SayHello(ctx, &pb.HelloRequest{Name: buf.String()}) - ``` - -the client program exits: - -``` -2024/05/10 15:09:23 could not greet: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 -``` - -In this case, validation is performed on the client side. If a client -somehow bypasses UTF-8 validation, the server will respond with an -error. - -``` -2024/05/10 15:28:13 could not greet: rpc error: code = Internal desc = grpc: error unmarshalling request: string field contains invalid UTF-8 -``` - -It appears possible to configure gRPC and/or the protocol buffer -library to bypass UTF-8 validation, but by default gRPC-Go will -validate on both sides. - -#### Rust client - -With a Rust gRPC client modified to send an invalid string, e.g., - -``` - let bytes = vec![255, 255]; - - let request = tonic::Request::new(HelloRequest { - name: unsafe { String::from_utf8_unchecked(bytes) }, - }); -``` - -the client program exits: - -``` -Error: Status { code: Internal, message: "failed to decode Protobuf message: HelloRequest.name: invalid string value: data is not UTF-8 encoded", metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Fri, 10 May 2024 16:10:46 GMT", "content-length": "0"} }, source: None } -``` - -In this case, validation is only performed on the server side. - -#### OpenTelemetry Collector `pdata` - -The OpenTelemetry Collector uses Gogo protobuf package to unmarshal -the structs that underly its `pdata` objects. With the "gogofaster" -compiler option is selected, its `Unmarshal()` logic does not check -for valid UTF-8 strings. - -The potential for user harm is clear at this point. gRPC libraries -may intentionally bypass UTF-8 validation in some cases, and this may -go unnoticed until a small amount of invalid UTF-8 data leads to data -loss in a different part of the system where validation is strictly -performed. When this happens in a telemetry pipeline, it can lead -loss of valid data that was inadvertently combined with invalid data, -possibly through batching. - -### Responsibility to the user - -When a user uses a telemetry SDK to gain observability into their -application, it can lead to quite the opposite result when invalid -UTF-8 causes loss of telemetry. When a user turns to a telemetry -system to obseve their system and makes a UTF-8 mistake, the -consequent total loss of telemetry is a definite bad outcome. For the -user to be well-served in this situation, OpenTelemetry SDKs and -Collectors SHOULD correct invalid UTF-8 using the Unicode replacement -character, `�` (U+FFFD) as a replacement for all invalid substrings. - -We take the position that OpenTelemety SDKs and Collectors SHOULD -explicitly make an effort to preserve telemetry data when -presumed-UTF-8 data is invalid, rather than allowing it to drop with -an error as a result. Accepting this proposal means this policy will -be added to the OpenTelemetry error handling guidelines. - -#### Give users what they want - -The OpenTelemetry API specification does not permit the use of -byte-slice valued attributes. This is one reason why invalid UTF-8 -arises, because OpenTelemetry users have not been given an -alternative. - -When faced with a desire to use record invalid UTF-8 in a stream of -telemetry, users have no good options. They can use a base64 -encoding, but they will have to do so manually, and with extra code, -and they are likely to learn this only after first making the mistake -of using a string-value to encode bytes data. When users see -corrupted telemetry data containing � characters, they will return to -OpenTelemetry and with a request, how should they report bytes data. - -Ironically, in a prototype Collector processor to repair invalid -UTF-8, we have used the Zap Logger's `zap.Binary([]byte)` field -constructor to report the invalid data, and this is something an -OpenTelemetry API user cannot do. - -### Alternatives considered - -There are scenarios, such as in the Rust SDK, where there is a strong -expectation of safety and correctness. In these settings, where the -use of `unsafe` is possible but rare and invalid UTF-8 is unlikely, it -can make sense to bypass UTF-8 validation in the client to save a few -CPU cycles. - -We could imagine a more permissive OpenTelemetry policy, in which SDKs -and Collectors are advised to treat all String-valued fields in the -OTLP protocol buffer as "presumptive" UTF-8, meaning data that SHOULD -be encoded as UTF-8 but that has not been checked for validity. Under -this policy, SDKs and Collectors would propagate invalid data and -leave it to the consumer to decide how to handle it. - -## Trade-offs and mitigations - -TODO - -## Prior art and alternatives - -TODO - -## Open questions - -TODO - -## Future possibilities - -TODO diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md new file mode 100644 index 000000000..4f1ba2f98 --- /dev/null +++ b/text/0257-utf8-handling.md @@ -0,0 +1,223 @@ +# Consistent position towards UTF-8 validation in OpenTelemetry + +The OpenTelemetry project has not taken a strong position on how to handle +invalid UTF-8 in its SDKs and Collector components. + +## Motivation + +When handling of invalid UTF-8 is left unspecified, anything can +happen. As a project, OpenTelemetry should give its component authors +guidance on how to treat invalid UTF-8 so that users can confidently +rely on OpenTelemetry software/systems. + +## Explanation + +OpenTelemetry has existing [error handling +guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md) +which dictate the project's general posture towards handling. Users +expect telemety libraries to be harmless and safe to use, so that they +can confidently instrument their application without unnecessary risk. + +> The API and SDK SHOULD provide safe defaults for missing or invalid arguments. + +Generally, users should not have to check for errors from potentially +fallible instrumentation APIs, because it burdens the user, and that +instead OpenTelemetry SDKs should resort to corrective action. +Quoting the error handling guidelines, + +> For instance, a name like empty may be used if the user passes in +> null as the span name argument during Span construction. + +Users also make this demand of the OpenTelemetry collector, which is +expected to safely transport telemetry data. + +OpenTelemetry's OTLP protocol, specifically dictates how [valid and +invalid UTF-8 strings should be mapped into attribute +values](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values), +but there is no guidance on how to treat many other `string` fields +found in the OTLP protocol. The Span `Name` field, for example, +SHOULD be a valid UTF-8 string according to the Protocol Buffer +`proto3`-syntax specification. What is expected from the SDK when a +presumptive UTF-8 field is actually invalid UTF-8? What is expected +from the Collector? Without a clear definition, users are likely to +experience harmful loss of data. + +This document proposes to update OpenTelemetry error handling +guidelines to require SDKs and Collectors to protect users from loss +of telemetry caused by string-valued fields that are not well-formed. +A number of options are available, all of which are better than doing +nothing about this problem. + +This document also proposes to extend the OpenTelemetry API to support +byte-slice valued attributes. Otherwise, users have no good +alterntives to handling this kind of data, which is often how invalid +UTF-8 arises. + +## Internal details + +Text handling is such a key aspect of programming languages that +practically, they all present different a approach. While most +languages provide guardrails that maintain a well-formed character +encoding in memory at runtime, there is usually a way for a programmer +handling bytes incorrectly to yield an invalid string encoding. + +These are honest mistakes. If the telemetry will pass through an OTLP +representation and UTF-8 validation is left unchecked, there is an +opportunity for an entire batch of data to be rejected by the protocol +buffer subsystem used in both gRPC and HTTP code paths. + +This is especially harmful to users, as the observability tools they +have in place suddely can't help them. Users can't log, span, or +metric about the invalid UTF-8 data, and terminals may also have +trouble displaying the data. For users to learn the source of their +problem, some component has to correct the invalid data before data +loss happens. + +### Possible resolutions + +Here are some three options: + +1. Automatically correct invalid UTF-8. +2. Drop partial batches of data containing invalid UTF-8. +3. Permissively allow invalid UTF-8 to propagate (further) into the pipeline. + +In this proposal, we reject the third option. See alternatives considered, below. + +#### Automatic correction + +This document proposes that OpenTelemetry's default approach should be +the one taken by the Rust `String::from_utf8_lossy` method, which +converts invalid UTF-8 byte sequences to the Unicode replacement +character, `�` (U+FFFD). This is the most preferable outcome as it is +simple and preserves what valid content can be recovered from the +data. + +#### Dropping data + +Among the options available, we view dropping whole batches of +telemetry as an unacceptable outcome. OpenTelemetry SDKs and +Collectors SHOULD prefer to drop individual items of telemetry as +opposed to dropping whole batches of telemetry. + +When a Collector drops individual items of telemetry as a result of +invalid UTF-8, it SHOULD return a `rejected_spans`, +`rejected_data_points`, or `rejected_log_records` count along with a +message indicating data loss. + +### Survey of existing systems + +The Go gRPC-Go implementation with its default protocol buffer checks +for valid UTF-8 in both the client (`rpc error: code = Internal desc = +grpc: error while marshaling: string field contains invalid UTF-8`) +and the server (`rpc error: code = Internal desc = grpc: error +unmarshalling request: string field contains invalid UTF-8`). + +The Rust Hyperium gRPC implementation checks for valid UTF-8 on the +server but not on the client (`Error: Status { code: Internal, +message: "failed to decode Protobuf message: HelloRequest.name: +invalid string value: data is not UTF-8 encoded"...`). + +The OpenTelemetry Collector OTLP exporter and receiver do not validate +UTF-8, as a result of choosing the "Gogofaster" protocol buffer +library option. + +Given these out-of-the-box configurations, it would be possible for a +Rust SDK to send invald UTF-8 to an OpenTelemetry Collector's OTLP +receiver. That collector could forward to another OpenTelemtry +Collector over OTLP successfully, but a non-OTLP exporter using a +diferent protocol buffer implementation would likely be the first to +observe the invalid UTF-8, and by that time, a large batch of +telemetry could fail as a result. + +### Responsibility to the API user + +When a user uses a telemetry SDK to gain observability into their +application, it can lead to quite the opposite result when invalid +UTF-8 causes loss of telemetry. When a user turns to a telemetry +system to obseve their system and makes a UTF-8 mistake, the +consequent total loss of telemetry is a definite bad outcome. While +the above discussion addresses attempts to correct a problem, +OpenTelemetry has not given the API user what they want in this +situation. + +#### Byte-slice valued attribute API + +The OpenTelemetry API specification does not permit the use of +byte-slice valued attributes. This is one reason why invalid UTF-8 +arises, because OpenTelemetry users have not been given an +alternative. + +When faced with a desire to record invalid UTF-8 in a stream of +telemetry, users have no good options. They can use a base64 +encoding, but they will have to do so manually, and with extra code, +and they are likely to learn this only after first making the mistake +of using a string-value to encode bytes data. When users see +corrupted telemetry data containing � characters, they will return to +OpenTelemetry with a request: how should they report bytes data? + +Ironically, in a prototype Collector processor to repair invalid +UTF-8, we have used the Zap Logger's `zap.Binary([]byte)` field +constructor to report the invalid data, and this is something an +OpenTelemetry API user cannot do. + +This OTEP proposes to add an attribute-value type for byte slices, to +represent an array of bytes with unknown encoding. Examples data +values where this attribute could be useful: + +- checksum values (e.g., a sha256 checksum represented as 32 bytes) +- hash values (e.g., a 56-bit hash value represented as 7 bytes) +- register contents (e.g., a memory region ... 64 bytes) +- invalid UTF-8 (e.g., data lost because of UTF-8 validation). + +### Responsibility to the Collector user + +Users need to trust that the OpenTelemetry Collector will not cause +unintended data loss, and we observe this can easily result from a +mismatch of permissive receiver and strict exporter. However, we are +aware that a strict receiver will cause whole batches of telemetry to +be lost, therefore it seems better for Collector pipelines to +explicitly handle UTF-8 validation, rather than leave it to the +protocol buffer library. + +OpenTelemetry Collector SHOULD support automatic UTF-8 validation to +protect users. There are several places this could be done: + +1. Following a receiver, with data coming from an external source, +2. Following a mutating processor, with data modified by potentially + untrustworthy code, +3. Prior to an exporter, with data coming from either an external + source and/or modified by potentially untrustworhty code. + +Depending on user preferences, any of these outcomes might be best. +Every mutating processor could force the pipeline to re-check for +validity, in the strictest of configurations. + +#### New collector component Capabilities for UTF-8 strictness + +Each collector component that strongly requires valid UTF-8 will +declare so in an Capabilities field, and this will cause the Collector +to re-validate UTF-8 before invoking that component. By default, +exporters will require valid UTF-8. + +Provided that no processors are strict about UTF-8 validation, UTF-8 +validation will happen only once per pipeline segment. This means +UTF-8 validation can be deferred until after sampling and filtering +operations have finsihed. Each pipeline segment will have at least +one UTF-8 validation step, which will either automatically correct the +problem (default) or reject the individual item of data (optional). + +### Alternatives considered + +We observe that some protocol buffer implementations are permissive +with respect to invalid UTF-8, while others are strict. It can be +risky to combine dissimilar protocol buffer implementations in a +pipeline, since a receiver might accept data that an exporter cannot +send, simply resulting from invalid UTF-8. + +Considering whether components could support permissive data +transport, it appears to be risky and for little reward. If +OpenTelemetry decided to allow this, it would not be a sensible +default. On the other hand, existing OpenTelemetry Collectors have +this behavior, and changing it will require time. In the period of +time while this situation persists, we effectively have permissive +data transport (and the associated risks). From c30ff3d32901d81d70064dfc08629f84f8891677 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 22 May 2024 09:13:10 -0700 Subject: [PATCH 3/9] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Gerhard Stöbich Co-authored-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com> --- text/0257-utf8-handling.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md index 4f1ba2f98..7efb46070 100644 --- a/text/0257-utf8-handling.md +++ b/text/0257-utf8-handling.md @@ -15,7 +15,7 @@ rely on OpenTelemetry software/systems. OpenTelemetry has existing [error handling guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md) which dictate the project's general posture towards handling. Users -expect telemety libraries to be harmless and safe to use, so that they +expect telemetry libraries to be harmless and safe to use, so that they can confidently instrument their application without unnecessary risk. > The API and SDK SHOULD provide safe defaults for missing or invalid arguments. @@ -50,7 +50,7 @@ nothing about this problem. This document also proposes to extend the OpenTelemetry API to support byte-slice valued attributes. Otherwise, users have no good -alterntives to handling this kind of data, which is often how invalid +alternatives to handling this kind of data, which is often how invalid UTF-8 arises. ## Internal details From 47b656f05b9ade7fe2db1eb13ac8c4eb9b4c28bc Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 10 Oct 2024 14:46:54 -0700 Subject: [PATCH 4/9] Edits. --- text/0257-utf8-handling.md | 93 ++++++++++++++------------------------ 1 file changed, 34 insertions(+), 59 deletions(-) diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md index 4f1ba2f98..c71a6755d 100644 --- a/text/0257-utf8-handling.md +++ b/text/0257-utf8-handling.md @@ -48,11 +48,6 @@ of telemetry caused by string-valued fields that are not well-formed. A number of options are available, all of which are better than doing nothing about this problem. -This document also proposes to extend the OpenTelemetry API to support -byte-slice valued attributes. Otherwise, users have no good -alterntives to handling this kind of data, which is often how invalid -UTF-8 arises. - ## Internal details Text handling is such a key aspect of programming languages that @@ -129,45 +124,17 @@ diferent protocol buffer implementation would likely be the first to observe the invalid UTF-8, and by that time, a large batch of telemetry could fail as a result. -### Responsibility to the API user - -When a user uses a telemetry SDK to gain observability into their -application, it can lead to quite the opposite result when invalid -UTF-8 causes loss of telemetry. When a user turns to a telemetry -system to obseve their system and makes a UTF-8 mistake, the -consequent total loss of telemetry is a definite bad outcome. While -the above discussion addresses attempts to correct a problem, -OpenTelemetry has not given the API user what they want in this -situation. - -#### Byte-slice valued attribute API - -The OpenTelemetry API specification does not permit the use of -byte-slice valued attributes. This is one reason why invalid UTF-8 -arises, because OpenTelemetry users have not been given an -alternative. - -When faced with a desire to record invalid UTF-8 in a stream of -telemetry, users have no good options. They can use a base64 -encoding, but they will have to do so manually, and with extra code, -and they are likely to learn this only after first making the mistake -of using a string-value to encode bytes data. When users see -corrupted telemetry data containing � characters, they will return to -OpenTelemetry with a request: how should they report bytes data? - -Ironically, in a prototype Collector processor to repair invalid -UTF-8, we have used the Zap Logger's `zap.Binary([]byte)` field -constructor to report the invalid data, and this is something an -OpenTelemetry API user cannot do. - -This OTEP proposes to add an attribute-value type for byte slices, to -represent an array of bytes with unknown encoding. Examples data -values where this attribute could be useful: - -- checksum values (e.g., a sha256 checksum represented as 32 bytes) -- hash values (e.g., a 56-bit hash value represented as 7 bytes) -- register contents (e.g., a memory region ... 64 bytes) -- invalid UTF-8 (e.g., data lost because of UTF-8 validation). +### No byte-slice valued attribute API + +As a caveat, the OpenTelemetry project has previously debated and +rejected the potential to support a byte-slice typed attribute. This +potential feature was rejected because it allows API users a way to +record a possibly uninterpretable sequence of bytes. Users with +invalid UTF-8 are left with a few options, for example: + +- Base64-encode the invalid data wrapped in human-readable syntax + (e.g., `base64:WNhbHF1ZWVuY29hY2hod`). +- Transmute the byte slice to an integer slice, which is supported. ### Responsibility to the Collector user @@ -188,26 +155,34 @@ protect users. There are several places this could be done: 3. Prior to an exporter, with data coming from either an external source and/or modified by potentially untrustworhty code. -Depending on user preferences, any of these outcomes might be best. -Every mutating processor could force the pipeline to re-check for -validity, in the strictest of configurations. - -#### New collector component Capabilities for UTF-8 strictness +Each of these approaches will take significant effort and cost the +user at runtime, therefore: -Each collector component that strongly requires valid UTF-8 will -declare so in an Capabilities field, and this will cause the Collector -to re-validate UTF-8 before invoking that component. By default, -exporters will require valid UTF-8. +- UTF-8 validation SHOULD be enabled by default +- Users SHOULD be able to opt out of UTF-8 validation +- A `receiverhelper` library SHOULD offer a function to correct + invalid UTF-8 in-place, with two configurable outcomes (1) reject + individual items containing invalid UTF-8, (2) repair invalid UTF-8. -Provided that no processors are strict about UTF-8 validation, UTF-8 -validation will happen only once per pipeline segment. This means -UTF-8 validation can be deferred until after sampling and filtering -operations have finsihed. Each pipeline segment will have at least -one UTF-8 validation step, which will either automatically correct the -problem (default) or reject the individual item of data (optional). +When an OpenTelemetry collector receives telemetry data in any +protocol, in cases where the underlying RPC or protocol buffer +libraries does not guarantee valid UTF-8, the `receiverhelper` library +SHOULD be called to optionally validate UTF-8 according to the +confiuration described above. ### Alternatives considered +#### Exhaustive validation + +The Collector behavior proposed above only validates data after it is +received, not after it is modified by a processor. We consider the +risk of a malfunctioning processor to be acceptable. If this happens, +it will be considered a bug in the Collector component. In other +words, the Collector SHOULD NOT perform internal data validation and +it SHOULD perform external data validation. + +#### Permissive trasnport + We observe that some protocol buffer implementations are permissive with respect to invalid UTF-8, while others are strict. It can be risky to combine dissimilar protocol buffer implementations in a From f23f8b07d0786ae518f6a9ec99e2aded72656021 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 10 Oct 2024 14:52:38 -0700 Subject: [PATCH 5/9] lint --- text/0257-utf8-handling.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md index 9846e32a3..9e9773751 100644 --- a/text/0257-utf8-handling.md +++ b/text/0257-utf8-handling.md @@ -120,7 +120,7 @@ Given these out-of-the-box configurations, it would be possible for a Rust SDK to send invald UTF-8 to an OpenTelemetry Collector's OTLP receiver. That collector could forward to another OpenTelemtry Collector over OTLP successfully, but a non-OTLP exporter using a -diferent protocol buffer implementation would likely be the first to +different protocol buffer implementation would likely be the first to observe the invalid UTF-8, and by that time, a large batch of telemetry could fail as a result. @@ -132,7 +132,7 @@ potential feature was rejected because it allows API users a way to record a possibly uninterpretable sequence of bytes. Users with invalid UTF-8 are left with a few options, for example: -- Base64-encode the invalid data wrapped in human-readable syntax +- Base64-encode the invalid data wrapped in human-readable syntax (e.g., `base64:WNhbHF1ZWVuY29hY2hod`). - Transmute the byte slice to an integer slice, which is supported. @@ -149,9 +149,9 @@ protocol buffer library. OpenTelemetry Collector SHOULD support automatic UTF-8 validation to protect users. There are several places this could be done: -1. Following a receiver, with data coming from an external source, +1. Following a receiver, with data coming from an external source, 2. Following a mutating processor, with data modified by potentially - untrustworthy code, + untrustworthy code, 3. Prior to an exporter, with data coming from either an external source and/or modified by potentially untrustworhty code. @@ -161,7 +161,7 @@ user at runtime, therefore: - UTF-8 validation SHOULD be enabled by default - Users SHOULD be able to opt out of UTF-8 validation - A `receiverhelper` library SHOULD offer a function to correct - invalid UTF-8 in-place, with two configurable outcomes (1) reject + invalid UTF-8 in-place, with two configurable outcomes (1) reject individual items containing invalid UTF-8, (2) repair invalid UTF-8. When an OpenTelemetry collector receives telemetry data in any @@ -181,7 +181,7 @@ it will be considered a bug in the Collector component. In other words, the Collector SHOULD NOT perform internal data validation and it SHOULD perform external data validation. -#### Permissive trasnport +#### Permissive transport We observe that some protocol buffer implementations are permissive with respect to invalid UTF-8, while others are strict. It can be From d79c75f780ccf62538d209a5bfdda93de7945ccf Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 10 Oct 2024 16:25:53 -0700 Subject: [PATCH 6/9] apply feedback; be specific --- text/0257-utf8-handling.md | 60 +++++++++++++++++++++++++++++++------- 1 file changed, 50 insertions(+), 10 deletions(-) diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md index 9e9773751..7457163a3 100644 --- a/text/0257-utf8-handling.md +++ b/text/0257-utf8-handling.md @@ -124,12 +124,46 @@ different protocol buffer implementation would likely be the first to observe the invalid UTF-8, and by that time, a large batch of telemetry could fail as a result. -### No byte-slice valued attribute API +### Responsibility to the SDK user + +The existing specification [dictates a safety mechanism for exporting +invalid string-valued attributes safely][SAFETY], however it only +applies to attribute strings: + +[SAFETY]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values + +``` +## String Values +String values which are valid UTF-8 sequences SHOULD be converted to AnyValue's string_value field. + +String values which are not valid Unicode sequences SHOULD be converted to AnyValue's bytes_value with the bytes representing the string in the original order and format of the source string. +``` + +To satisfy the existing error-handling requirements, the OpenTelemetry +SDK specifications will be modified for all signals with an opt-out +validation feature. + +#### Proposed Collector behavior change + +The SDK SHOULD in its default configuration validate all string-valued +telemetry data fields. Each run of invalid UTF-8 (i.e., any invalid +UTF-8 sequences) will be replaced by a single Unicode replacement +character, `�` (U+FFFD). The exact behavior of this correction is +undefined. When possible, SDKs SHOULD use a built-in library for this +repair (for example, [Golang's +`strings.ToValidUTF8()`](https://pkg.go.dev/strings#ToValidUTF8) or +[Rust's +`String::to_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy) +satisfy this requirement). + +#### No byte-slice valued attribute API As a caveat, the OpenTelemetry project has previously debated and -rejected the potential to support a byte-slice typed attribute. This -potential feature was rejected because it allows API users a way to -record a possibly uninterpretable sequence of bytes. Users with +rejected the potential to support a byte-slice typed attribute in +OpenTelemetry APIs. + +This potential feature was rejected because it allows API users a way +to record a possibly uninterpretable sequence of bytes. Users with invalid UTF-8 are left with a few options, for example: - Base64-encode the invalid data wrapped in human-readable syntax @@ -146,8 +180,8 @@ be lost, therefore it seems better for Collector pipelines to explicitly handle UTF-8 validation, rather than leave it to the protocol buffer library. -OpenTelemetry Collector SHOULD support automatic UTF-8 validation to -protect users. There are several places this could be done: +OpenTelemetry Collector should support automatic UTF-8 validation to +protect users, however there are several places this could be done: 1. Following a receiver, with data coming from an external source, 2. Following a mutating processor, with data modified by potentially @@ -155,14 +189,20 @@ protect users. There are several places this could be done: 3. Prior to an exporter, with data coming from either an external source and/or modified by potentially untrustworhty code. -Each of these approaches will take significant effort and cost the -user at runtime, therefore: +Each of these approaches will take significant effort and vary in cost +at runtime. + +#### Proposed Collector behavior change + +To reduce the cost of UTF-8 validation to a minimum, we propose: -- UTF-8 validation SHOULD be enabled by default +- UTF-8 validation SHOULD be enabled by default for all Receiver components - Users SHOULD be able to opt out of UTF-8 validation - A `receiverhelper` library SHOULD offer a function to correct invalid UTF-8 in-place, with two configurable outcomes (1) reject - individual items containing invalid UTF-8, (2) repair invalid UTF-8. + individual items containing invalid UTF-8, meaning to count them as + rejected spans/points/logs, and (2) repair invalid UTF-8 as specified + for SDKs above. When an OpenTelemetry collector receives telemetry data in any protocol, in cases where the underlying RPC or protocol buffer From e84d1e9dcb2ef6e3d7331c9201ac457077872d77 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 10 Oct 2024 16:26:24 -0700 Subject: [PATCH 7/9] typo --- text/0257-utf8-handling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md index 7457163a3..4448a7fa5 100644 --- a/text/0257-utf8-handling.md +++ b/text/0257-utf8-handling.md @@ -153,7 +153,7 @@ undefined. When possible, SDKs SHOULD use a built-in library for this repair (for example, [Golang's `strings.ToValidUTF8()`](https://pkg.go.dev/strings#ToValidUTF8) or [Rust's -`String::to_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy) +`String::from_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy) satisfy this requirement). #### No byte-slice valued attribute API From 10721f9fa85ad276ffbbea7f0db2921bf2bc0cc4 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 10 Oct 2024 16:29:59 -0700 Subject: [PATCH 8/9] edits --- text/0257-utf8-handling.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md index 4448a7fa5..faf7c8dc6 100644 --- a/text/0257-utf8-handling.md +++ b/text/0257-utf8-handling.md @@ -156,6 +156,8 @@ repair (for example, [Golang's `String::from_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy) satisfy this requirement). +SDKs MAY permit users to opt-out of UTF-8 validation for performance reasons. + #### No byte-slice valued attribute API As a caveat, the OpenTelemetry project has previously debated and @@ -180,7 +182,7 @@ be lost, therefore it seems better for Collector pipelines to explicitly handle UTF-8 validation, rather than leave it to the protocol buffer library. -OpenTelemetry Collector should support automatic UTF-8 validation to +The OpenTelemetry Collector should support automatic UTF-8 validation to protect users, however there are several places this could be done: 1. Following a receiver, with data coming from an external source, @@ -197,7 +199,7 @@ at runtime. To reduce the cost of UTF-8 validation to a minimum, we propose: - UTF-8 validation SHOULD be enabled by default for all Receiver components -- Users SHOULD be able to opt out of UTF-8 validation +- Users SHOULD be able to opt-out of UTF-8 validation - A `receiverhelper` library SHOULD offer a function to correct invalid UTF-8 in-place, with two configurable outcomes (1) reject individual items containing invalid UTF-8, meaning to count them as From 4dbb16f6b60d1a54ee76f31a027ab59324a12c94 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 10 Oct 2024 16:37:14 -0700 Subject: [PATCH 9/9] edit --- text/0257-utf8-handling.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md index faf7c8dc6..ed6a0731a 100644 --- a/text/0257-utf8-handling.md +++ b/text/0257-utf8-handling.md @@ -158,6 +158,11 @@ satisfy this requirement). SDKs MAY permit users to opt-out of UTF-8 validation for performance reasons. +Optional UTF-8 validation MUST be applied before the attribute +transformation rule stated for invalid UTF-8 strings. This means +UTF-8 validation prevents downgrading to a `bytes_value` encoding +for string attributes. + #### No byte-slice valued attribute API As a caveat, the OpenTelemetry project has previously debated and