From 4f6c3a5fd387cf93c2af67fca550c439eea4eb69 Mon Sep 17 00:00:00 2001
From: Joshua MacDonald <josh.macdonald@gmail.com>
Date: Fri, 10 May 2024 16:23:45 -0700
Subject: [PATCH 1/9] Draft invalid UTF-8 error handling policy

---
 text/0000-utf8-handling.md | 205 +++++++++++++++++++++++++++++++++++++
 1 file changed, 205 insertions(+)
 create mode 100644 text/0000-utf8-handling.md

diff --git a/text/0000-utf8-handling.md b/text/0000-utf8-handling.md
new file mode 100644
index 000000000..bcfd7b36c
--- /dev/null
+++ b/text/0000-utf8-handling.md
@@ -0,0 +1,205 @@
+# Consistent position towards UTF-8 validation in OpenTelemetry
+
+The OpenTelemetry project has not taken a strong position on how to handle
+invalid UTF-8 in its SDKs and collector components.
+
+## Motivation
+
+When handling of invalid UTF-8 is left unspecified, anything can
+happen.  As a project, OpenTelemetry should give its component authors
+guidance on how to treat invalid UTF-8 so that users can confidently
+rely on OpenTelemetry software/systems.
+
+## Explanation
+
+OpenTelemetry has existing [error handling
+guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md)
+which dictate the project's general posture towards handling.  Users
+expect telemety libraries to be harmless and safe to use, so that they
+can confidently instrument their application without unnecessary risk.
+Users also make this demand of the OpenTelemetry collector, which is
+expected to safely transport telemetry data.
+
+OpenTelemetry's OTLP protocol, specifically dictates how [valid and
+invalid UTF-8 strings should be mapped into attribute
+values](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values),
+but there is no guidance on how to treat many other `string` fields
+found in the OTLP protocol.  The Span `Name` field, for example,
+SHOULD be a valid UTF-8 string according to the Protocol Buffer
+`proto3`-syntax specification.  What is expected from the SDK when a
+presumptive UTF-8 field is actually invalid UTF-8?  What is expected
+from the Collector?  Without a clear definition, users are likely to
+be experience harm through data loss.
+
+This document proposes to update OpenTelemetry error handling
+guidelines to require SDKs and Collectors to protect users from loss
+of telemetry caused by string-valued fields that are not well-formed.
+A number of options are available, all of which are better than doing
+nothing about this problem.
+
+## Internal details
+
+Text handling is such a key aspect of programming languages that
+practically, they all present different a approach.  While most
+languages provide guardrails that maintain a well-formed character
+encoding in memory at runtime, there is usually a way for a programmer
+handling bytes incorrectly to yield an invalid string encoding.
+
+These are honest mistakes.  The OpenTelemetry project should declare
+invalid UTF-8 as a non-error condition, consistent with its error
+handling principles.
+
+> The API and SDK SHOULD provide safe defaults for missing or invalid arguments.
+
+This document proposes that OpenTelemetry's default approach should be
+the one taken by the Rust `String::from_utf8_lossy` method, which
+converts invalid UTF-8 byte sequences to the Unicode replacement
+character, `�` (U+FFFD).  Taking this approach by default will avoid
+data loss for users when the inevitable happens.
+
+### Scenarios covered
+
+In an example scenario, a user creates a span with an invalid name.
+Without further intervention, the trace SDK then exports this through
+OTLP over gRPC.  What happens next depends on which language, RPC
+client library, and protocol buffer library is used.  For this
+document, we examined the official [Golang
+gRPC](https://github.com/grpc/grpc-go) implementation and the [Rust
+hyperium/tonic](https://github.com/hyperium/tonic) implementation.
+
+For demonstration purposes, gRPC libraries usually provide a Hello
+World client and server program.  This study modifies the Hello World
+client available in each language.
+
+#### Golang client
+
+With a Golang gRPC client modified to send an invalid string, e.g.,
+
+```
+	var buf bytes.Buffer
+	buf.WriteString(*name)
+	buf.Write([]byte{0xff, 0xff})
+    // ...
+	r, err := c.SayHello(ctx, &pb.HelloRequest{Name: buf.String()})
+ ```
+
+the client program exits:
+
+```
+2024/05/10 15:09:23 could not greet: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8
+```
+
+In this case, validation is performed on the client side.  If a client
+somehow bypasses UTF-8 validation, the server will respond with an
+error.
+
+```
+2024/05/10 15:28:13 could not greet: rpc error: code = Internal desc = grpc: error unmarshalling request: string field contains invalid UTF-8
+```
+
+It appears possible to configure gRPC and/or the protocol buffer
+library to bypass UTF-8 validation, but by default gRPC-Go will
+validate on both sides.
+
+#### Rust client
+
+With a Rust gRPC client modified to send an invalid string, e.g.,
+
+```
+    let bytes = vec![255, 255];
+    
+    let request = tonic::Request::new(HelloRequest {
+        name: unsafe { String::from_utf8_unchecked(bytes) },
+    });
+```
+
+the client program exits:
+
+```
+Error: Status { code: Internal, message: "failed to decode Protobuf message: HelloRequest.name: invalid string value: data is not UTF-8 encoded", metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Fri, 10 May 2024 16:10:46 GMT", "content-length": "0"} }, source: None }
+```
+
+In this case, validation is only performed on the server side.
+
+#### OpenTelemetry Collector `pdata`
+
+The OpenTelemetry Collector uses Gogo protobuf package to unmarshal
+the structs that underly its `pdata` objects.  With the "gogofaster"
+compiler option is selected, its `Unmarshal()` logic does not check
+for valid UTF-8 strings.
+
+The potential for user harm is clear at this point.  gRPC libraries
+may intentionally bypass UTF-8 validation in some cases, and this may
+go unnoticed until a small amount of invalid UTF-8 data leads to data
+loss in a different part of the system where validation is strictly
+performed.  When this happens in a telemetry pipeline, it can lead
+loss of valid data that was inadvertently combined with invalid data,
+possibly through batching.
+
+### Responsibility to the user
+
+When a user uses a telemetry SDK to gain observability into their
+application, it can lead to quite the opposite result when invalid
+UTF-8 causes loss of telemetry.  When a user turns to a telemetry
+system to obseve their system and makes a UTF-8 mistake, the
+consequent total loss of telemetry is a definite bad outcome.  For the
+user to be well-served in this situation, OpenTelemetry SDKs and
+Collectors SHOULD correct invalid UTF-8 using the Unicode replacement
+character, `�` (U+FFFD) as a replacement for all invalid substrings.
+
+We take the position that OpenTelemety SDKs and Collectors SHOULD
+explicitly make an effort to preserve telemetry data when
+presumed-UTF-8 data is invalid, rather than allowing it to drop with
+an error as a result.  Accepting this proposal means this policy will
+be added to the OpenTelemetry error handling guidelines.
+
+#### Give users what they want
+
+The OpenTelemetry API specification does not permit the use of
+byte-slice valued attributes.  This is one reason why invalid UTF-8
+arises, because OpenTelemetry users have not been given an
+alternative.
+
+When faced with a desire to use record invalid UTF-8 in a stream of
+telemetry, users have no good options.  They can use a base64
+encoding, but they will have to do so manually, and with extra code,
+and they are likely to learn this only after first making the mistake
+of using a string-value to encode bytes data.  When users see
+corrupted telemetry data containing � characters, they will return to
+OpenTelemetry and with a request, how should they report bytes data.
+
+Ironically, in a prototype Collector processor to repair invalid
+UTF-8, we have used the Zap Logger's `zap.Binary([]byte)` field
+constructor to report the invalid data, and this is something an
+OpenTelemetry API user cannot do.
+
+### Alternatives considered
+
+There are scenarios, such as in the Rust SDK, where there is a strong
+expectation of safety and correctness.  In these settings, where the
+use of `unsafe` is possible but rare and invalid UTF-8 is unlikely, it
+can make sense to bypass UTF-8 validation in the client to save a few
+CPU cycles.
+
+We could imagine a more permissive OpenTelemetry policy, in which SDKs
+and Collectors are advised to treat all String-valued fields in the
+OTLP protocol buffer as "presumptive" UTF-8, meaning data that SHOULD
+be encoded as UTF-8 but that has not been checked for validity.  Under
+this policy, SDKs and Collectors would propagate invalid data and
+leave it to the consumer to decide how to handle it.
+
+## Trade-offs and mitigations
+
+TODO
+
+## Prior art and alternatives
+
+TODO
+
+## Open questions
+
+TODO
+
+## Future possibilities
+
+TODO

From 6df233db634c4bfd9e12dc0b6ffa64e048c20083 Mon Sep 17 00:00:00 2001
From: Joshua MacDonald <josh.macdonald@gmail.com>
Date: Mon, 13 May 2024 14:58:37 -0700
Subject: [PATCH 2/9] update, lint, pr num

---
 text/0000-utf8-handling.md | 205 ----------------------------------
 text/0257-utf8-handling.md | 223 +++++++++++++++++++++++++++++++++++++
 2 files changed, 223 insertions(+), 205 deletions(-)
 delete mode 100644 text/0000-utf8-handling.md
 create mode 100644 text/0257-utf8-handling.md

diff --git a/text/0000-utf8-handling.md b/text/0000-utf8-handling.md
deleted file mode 100644
index bcfd7b36c..000000000
--- a/text/0000-utf8-handling.md
+++ /dev/null
@@ -1,205 +0,0 @@
-# Consistent position towards UTF-8 validation in OpenTelemetry
-
-The OpenTelemetry project has not taken a strong position on how to handle
-invalid UTF-8 in its SDKs and collector components.
-
-## Motivation
-
-When handling of invalid UTF-8 is left unspecified, anything can
-happen.  As a project, OpenTelemetry should give its component authors
-guidance on how to treat invalid UTF-8 so that users can confidently
-rely on OpenTelemetry software/systems.
-
-## Explanation
-
-OpenTelemetry has existing [error handling
-guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md)
-which dictate the project's general posture towards handling.  Users
-expect telemety libraries to be harmless and safe to use, so that they
-can confidently instrument their application without unnecessary risk.
-Users also make this demand of the OpenTelemetry collector, which is
-expected to safely transport telemetry data.
-
-OpenTelemetry's OTLP protocol, specifically dictates how [valid and
-invalid UTF-8 strings should be mapped into attribute
-values](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values),
-but there is no guidance on how to treat many other `string` fields
-found in the OTLP protocol.  The Span `Name` field, for example,
-SHOULD be a valid UTF-8 string according to the Protocol Buffer
-`proto3`-syntax specification.  What is expected from the SDK when a
-presumptive UTF-8 field is actually invalid UTF-8?  What is expected
-from the Collector?  Without a clear definition, users are likely to
-be experience harm through data loss.
-
-This document proposes to update OpenTelemetry error handling
-guidelines to require SDKs and Collectors to protect users from loss
-of telemetry caused by string-valued fields that are not well-formed.
-A number of options are available, all of which are better than doing
-nothing about this problem.
-
-## Internal details
-
-Text handling is such a key aspect of programming languages that
-practically, they all present different a approach.  While most
-languages provide guardrails that maintain a well-formed character
-encoding in memory at runtime, there is usually a way for a programmer
-handling bytes incorrectly to yield an invalid string encoding.
-
-These are honest mistakes.  The OpenTelemetry project should declare
-invalid UTF-8 as a non-error condition, consistent with its error
-handling principles.
-
-> The API and SDK SHOULD provide safe defaults for missing or invalid arguments.
-
-This document proposes that OpenTelemetry's default approach should be
-the one taken by the Rust `String::from_utf8_lossy` method, which
-converts invalid UTF-8 byte sequences to the Unicode replacement
-character, `�` (U+FFFD).  Taking this approach by default will avoid
-data loss for users when the inevitable happens.
-
-### Scenarios covered
-
-In an example scenario, a user creates a span with an invalid name.
-Without further intervention, the trace SDK then exports this through
-OTLP over gRPC.  What happens next depends on which language, RPC
-client library, and protocol buffer library is used.  For this
-document, we examined the official [Golang
-gRPC](https://github.com/grpc/grpc-go) implementation and the [Rust
-hyperium/tonic](https://github.com/hyperium/tonic) implementation.
-
-For demonstration purposes, gRPC libraries usually provide a Hello
-World client and server program.  This study modifies the Hello World
-client available in each language.
-
-#### Golang client
-
-With a Golang gRPC client modified to send an invalid string, e.g.,
-
-```
-	var buf bytes.Buffer
-	buf.WriteString(*name)
-	buf.Write([]byte{0xff, 0xff})
-    // ...
-	r, err := c.SayHello(ctx, &pb.HelloRequest{Name: buf.String()})
- ```
-
-the client program exits:
-
-```
-2024/05/10 15:09:23 could not greet: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8
-```
-
-In this case, validation is performed on the client side.  If a client
-somehow bypasses UTF-8 validation, the server will respond with an
-error.
-
-```
-2024/05/10 15:28:13 could not greet: rpc error: code = Internal desc = grpc: error unmarshalling request: string field contains invalid UTF-8
-```
-
-It appears possible to configure gRPC and/or the protocol buffer
-library to bypass UTF-8 validation, but by default gRPC-Go will
-validate on both sides.
-
-#### Rust client
-
-With a Rust gRPC client modified to send an invalid string, e.g.,
-
-```
-    let bytes = vec![255, 255];
-    
-    let request = tonic::Request::new(HelloRequest {
-        name: unsafe { String::from_utf8_unchecked(bytes) },
-    });
-```
-
-the client program exits:
-
-```
-Error: Status { code: Internal, message: "failed to decode Protobuf message: HelloRequest.name: invalid string value: data is not UTF-8 encoded", metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Fri, 10 May 2024 16:10:46 GMT", "content-length": "0"} }, source: None }
-```
-
-In this case, validation is only performed on the server side.
-
-#### OpenTelemetry Collector `pdata`
-
-The OpenTelemetry Collector uses Gogo protobuf package to unmarshal
-the structs that underly its `pdata` objects.  With the "gogofaster"
-compiler option is selected, its `Unmarshal()` logic does not check
-for valid UTF-8 strings.
-
-The potential for user harm is clear at this point.  gRPC libraries
-may intentionally bypass UTF-8 validation in some cases, and this may
-go unnoticed until a small amount of invalid UTF-8 data leads to data
-loss in a different part of the system where validation is strictly
-performed.  When this happens in a telemetry pipeline, it can lead
-loss of valid data that was inadvertently combined with invalid data,
-possibly through batching.
-
-### Responsibility to the user
-
-When a user uses a telemetry SDK to gain observability into their
-application, it can lead to quite the opposite result when invalid
-UTF-8 causes loss of telemetry.  When a user turns to a telemetry
-system to obseve their system and makes a UTF-8 mistake, the
-consequent total loss of telemetry is a definite bad outcome.  For the
-user to be well-served in this situation, OpenTelemetry SDKs and
-Collectors SHOULD correct invalid UTF-8 using the Unicode replacement
-character, `�` (U+FFFD) as a replacement for all invalid substrings.
-
-We take the position that OpenTelemety SDKs and Collectors SHOULD
-explicitly make an effort to preserve telemetry data when
-presumed-UTF-8 data is invalid, rather than allowing it to drop with
-an error as a result.  Accepting this proposal means this policy will
-be added to the OpenTelemetry error handling guidelines.
-
-#### Give users what they want
-
-The OpenTelemetry API specification does not permit the use of
-byte-slice valued attributes.  This is one reason why invalid UTF-8
-arises, because OpenTelemetry users have not been given an
-alternative.
-
-When faced with a desire to use record invalid UTF-8 in a stream of
-telemetry, users have no good options.  They can use a base64
-encoding, but they will have to do so manually, and with extra code,
-and they are likely to learn this only after first making the mistake
-of using a string-value to encode bytes data.  When users see
-corrupted telemetry data containing � characters, they will return to
-OpenTelemetry and with a request, how should they report bytes data.
-
-Ironically, in a prototype Collector processor to repair invalid
-UTF-8, we have used the Zap Logger's `zap.Binary([]byte)` field
-constructor to report the invalid data, and this is something an
-OpenTelemetry API user cannot do.
-
-### Alternatives considered
-
-There are scenarios, such as in the Rust SDK, where there is a strong
-expectation of safety and correctness.  In these settings, where the
-use of `unsafe` is possible but rare and invalid UTF-8 is unlikely, it
-can make sense to bypass UTF-8 validation in the client to save a few
-CPU cycles.
-
-We could imagine a more permissive OpenTelemetry policy, in which SDKs
-and Collectors are advised to treat all String-valued fields in the
-OTLP protocol buffer as "presumptive" UTF-8, meaning data that SHOULD
-be encoded as UTF-8 but that has not been checked for validity.  Under
-this policy, SDKs and Collectors would propagate invalid data and
-leave it to the consumer to decide how to handle it.
-
-## Trade-offs and mitigations
-
-TODO
-
-## Prior art and alternatives
-
-TODO
-
-## Open questions
-
-TODO
-
-## Future possibilities
-
-TODO
diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md
new file mode 100644
index 000000000..4f1ba2f98
--- /dev/null
+++ b/text/0257-utf8-handling.md
@@ -0,0 +1,223 @@
+# Consistent position towards UTF-8 validation in OpenTelemetry
+
+The OpenTelemetry project has not taken a strong position on how to handle
+invalid UTF-8 in its SDKs and Collector components.
+
+## Motivation
+
+When handling of invalid UTF-8 is left unspecified, anything can
+happen.  As a project, OpenTelemetry should give its component authors
+guidance on how to treat invalid UTF-8 so that users can confidently
+rely on OpenTelemetry software/systems.
+
+## Explanation
+
+OpenTelemetry has existing [error handling
+guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md)
+which dictate the project's general posture towards handling.  Users
+expect telemety libraries to be harmless and safe to use, so that they
+can confidently instrument their application without unnecessary risk.
+
+> The API and SDK SHOULD provide safe defaults for missing or invalid arguments.
+
+Generally, users should not have to check for errors from potentially
+fallible instrumentation APIs, because it burdens the user, and that
+instead OpenTelemetry SDKs should resort to corrective action.
+Quoting the error handling guidelines,
+
+> For instance, a name like empty may be used if the user passes in
+> null as the span name argument during Span construction.
+
+Users also make this demand of the OpenTelemetry collector, which is
+expected to safely transport telemetry data.
+
+OpenTelemetry's OTLP protocol, specifically dictates how [valid and
+invalid UTF-8 strings should be mapped into attribute
+values](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values),
+but there is no guidance on how to treat many other `string` fields
+found in the OTLP protocol.  The Span `Name` field, for example,
+SHOULD be a valid UTF-8 string according to the Protocol Buffer
+`proto3`-syntax specification.  What is expected from the SDK when a
+presumptive UTF-8 field is actually invalid UTF-8?  What is expected
+from the Collector?  Without a clear definition, users are likely to
+experience harmful loss of data.
+
+This document proposes to update OpenTelemetry error handling
+guidelines to require SDKs and Collectors to protect users from loss
+of telemetry caused by string-valued fields that are not well-formed.
+A number of options are available, all of which are better than doing
+nothing about this problem.
+
+This document also proposes to extend the OpenTelemetry API to support
+byte-slice valued attributes.  Otherwise, users have no good
+alterntives to handling this kind of data, which is often how invalid
+UTF-8 arises.
+
+## Internal details
+
+Text handling is such a key aspect of programming languages that
+practically, they all present different a approach.  While most
+languages provide guardrails that maintain a well-formed character
+encoding in memory at runtime, there is usually a way for a programmer
+handling bytes incorrectly to yield an invalid string encoding.
+
+These are honest mistakes.  If the telemetry will pass through an OTLP
+representation and UTF-8 validation is left unchecked, there is an
+opportunity for an entire batch of data to be rejected by the protocol
+buffer subsystem used in both gRPC and HTTP code paths.
+
+This is especially harmful to users, as the observability tools they
+have in place suddely can't help them.  Users can't log, span, or
+metric about the invalid UTF-8 data, and terminals may also have
+trouble displaying the data.  For users to learn the source of their
+problem, some component has to correct the invalid data before data
+loss happens.
+
+### Possible resolutions
+
+Here are some three options:
+
+1. Automatically correct invalid UTF-8.
+2. Drop partial batches of data containing invalid UTF-8.
+3. Permissively allow invalid UTF-8 to propagate (further) into the pipeline.
+
+In this proposal, we reject the third option.  See alternatives considered, below.
+
+#### Automatic correction
+
+This document proposes that OpenTelemetry's default approach should be
+the one taken by the Rust `String::from_utf8_lossy` method, which
+converts invalid UTF-8 byte sequences to the Unicode replacement
+character, `�` (U+FFFD).  This is the most preferable outcome as it is
+simple and preserves what valid content can be recovered from the
+data.
+
+#### Dropping data
+
+Among the options available, we view dropping whole batches of
+telemetry as an unacceptable outcome.  OpenTelemetry SDKs and
+Collectors SHOULD prefer to drop individual items of telemetry as
+opposed to dropping whole batches of telemetry.
+
+When a Collector drops individual items of telemetry as a result of
+invalid UTF-8, it SHOULD return a `rejected_spans`,
+`rejected_data_points`, or `rejected_log_records` count along with a
+message indicating data loss.
+
+### Survey of existing systems
+
+The Go gRPC-Go implementation with its default protocol buffer checks
+for valid UTF-8 in both the client (`rpc error: code = Internal desc =
+grpc: error while marshaling: string field contains invalid UTF-8`)
+and the server (`rpc error: code = Internal desc = grpc: error
+unmarshalling request: string field contains invalid UTF-8`).
+
+The Rust Hyperium gRPC implementation checks for valid UTF-8 on the
+server but not on the client (`Error: Status { code: Internal,
+message: "failed to decode Protobuf message: HelloRequest.name:
+invalid string value: data is not UTF-8 encoded"...`).
+
+The OpenTelemetry Collector OTLP exporter and receiver do not validate
+UTF-8, as a result of choosing the "Gogofaster" protocol buffer
+library option.
+
+Given these out-of-the-box configurations, it would be possible for a
+Rust SDK to send invald UTF-8 to an OpenTelemetry Collector's OTLP
+receiver.  That collector could forward to another OpenTelemtry
+Collector over OTLP successfully, but a non-OTLP exporter using a
+diferent protocol buffer implementation would likely be the first to
+observe the invalid UTF-8, and by that time, a large batch of
+telemetry could fail as a result.
+
+### Responsibility to the API user
+
+When a user uses a telemetry SDK to gain observability into their
+application, it can lead to quite the opposite result when invalid
+UTF-8 causes loss of telemetry.  When a user turns to a telemetry
+system to obseve their system and makes a UTF-8 mistake, the
+consequent total loss of telemetry is a definite bad outcome.  While
+the above discussion addresses attempts to correct a problem,
+OpenTelemetry has not given the API user what they want in this
+situation.
+
+#### Byte-slice valued attribute API
+
+The OpenTelemetry API specification does not permit the use of
+byte-slice valued attributes.  This is one reason why invalid UTF-8
+arises, because OpenTelemetry users have not been given an
+alternative.
+
+When faced with a desire to record invalid UTF-8 in a stream of
+telemetry, users have no good options.  They can use a base64
+encoding, but they will have to do so manually, and with extra code,
+and they are likely to learn this only after first making the mistake
+of using a string-value to encode bytes data.  When users see
+corrupted telemetry data containing � characters, they will return to
+OpenTelemetry with a request: how should they report bytes data?
+
+Ironically, in a prototype Collector processor to repair invalid
+UTF-8, we have used the Zap Logger's `zap.Binary([]byte)` field
+constructor to report the invalid data, and this is something an
+OpenTelemetry API user cannot do.
+
+This OTEP proposes to add an attribute-value type for byte slices, to
+represent an array of bytes with unknown encoding.  Examples data
+values where this attribute could be useful:
+
+- checksum values (e.g., a sha256 checksum represented as 32 bytes)
+- hash values (e.g., a 56-bit hash value represented as 7 bytes)
+- register contents (e.g., a memory region ... 64 bytes)
+- invalid UTF-8 (e.g., data lost because of UTF-8 validation).
+
+### Responsibility to the Collector user
+
+Users need to trust that the OpenTelemetry Collector will not cause
+unintended data loss, and we observe this can easily result from a
+mismatch of permissive receiver and strict exporter.  However, we are
+aware that a strict receiver will cause whole batches of telemetry to
+be lost, therefore it seems better for Collector pipelines to
+explicitly handle UTF-8 validation, rather than leave it to the
+protocol buffer library.
+
+OpenTelemetry Collector SHOULD support automatic UTF-8 validation to
+protect users.  There are several places this could be done:
+
+1. Following a receiver, with data coming from an external source, 
+2. Following a mutating processor, with data modified by potentially
+   untrustworthy code, 
+3. Prior to an exporter, with data coming from either an external
+   source and/or modified by potentially untrustworhty code.
+
+Depending on user preferences, any of these outcomes might be best.
+Every mutating processor could force the pipeline to re-check for
+validity, in the strictest of configurations.
+
+#### New collector component Capabilities for UTF-8 strictness
+
+Each collector component that strongly requires valid UTF-8 will
+declare so in an Capabilities field, and this will cause the Collector
+to re-validate UTF-8 before invoking that component.  By default,
+exporters will require valid UTF-8.
+
+Provided that no processors are strict about UTF-8 validation, UTF-8
+validation will happen only once per pipeline segment.  This means
+UTF-8 validation can be deferred until after sampling and filtering
+operations have finsihed.  Each pipeline segment will have at least
+one UTF-8 validation step, which will either automatically correct the
+problem (default) or reject the individual item of data (optional).
+
+### Alternatives considered
+
+We observe that some protocol buffer implementations are permissive
+with respect to invalid UTF-8, while others are strict.  It can be
+risky to combine dissimilar protocol buffer implementations in a
+pipeline, since a receiver might accept data that an exporter cannot
+send, simply resulting from invalid UTF-8.
+
+Considering whether components could support permissive data
+transport, it appears to be risky and for little reward.  If
+OpenTelemetry decided to allow this, it would not be a sensible
+default.  On the other hand, existing OpenTelemetry Collectors have
+this behavior, and changing it will require time.  In the period of
+time while this situation persists, we effectively have permissive
+data transport (and the associated risks).

From c30ff3d32901d81d70064dfc08629f84f8891677 Mon Sep 17 00:00:00 2001
From: Joshua MacDonald <jmacd@users.noreply.github.com>
Date: Wed, 22 May 2024 09:13:10 -0700
Subject: [PATCH 3/9] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Gerhard Stöbich <deb2001-github@yahoo.de>
Co-authored-by: Joao Grassi <5938087+joaopgrassi@users.noreply.github.com>
---
 text/0257-utf8-handling.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md
index 4f1ba2f98..7efb46070 100644
--- a/text/0257-utf8-handling.md
+++ b/text/0257-utf8-handling.md
@@ -15,7 +15,7 @@ rely on OpenTelemetry software/systems.
 OpenTelemetry has existing [error handling
 guidelines](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/error-handling.md)
 which dictate the project's general posture towards handling.  Users
-expect telemety libraries to be harmless and safe to use, so that they
+expect telemetry libraries to be harmless and safe to use, so that they
 can confidently instrument their application without unnecessary risk.
 
 > The API and SDK SHOULD provide safe defaults for missing or invalid arguments.
@@ -50,7 +50,7 @@ nothing about this problem.
 
 This document also proposes to extend the OpenTelemetry API to support
 byte-slice valued attributes.  Otherwise, users have no good
-alterntives to handling this kind of data, which is often how invalid
+alternatives to handling this kind of data, which is often how invalid
 UTF-8 arises.
 
 ## Internal details

From 47b656f05b9ade7fe2db1eb13ac8c4eb9b4c28bc Mon Sep 17 00:00:00 2001
From: Joshua MacDonald <josh.macdonald@gmail.com>
Date: Thu, 10 Oct 2024 14:46:54 -0700
Subject: [PATCH 4/9] Edits.

---
 text/0257-utf8-handling.md | 93 ++++++++++++++------------------------
 1 file changed, 34 insertions(+), 59 deletions(-)

diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md
index 4f1ba2f98..c71a6755d 100644
--- a/text/0257-utf8-handling.md
+++ b/text/0257-utf8-handling.md
@@ -48,11 +48,6 @@ of telemetry caused by string-valued fields that are not well-formed.
 A number of options are available, all of which are better than doing
 nothing about this problem.
 
-This document also proposes to extend the OpenTelemetry API to support
-byte-slice valued attributes.  Otherwise, users have no good
-alterntives to handling this kind of data, which is often how invalid
-UTF-8 arises.
-
 ## Internal details
 
 Text handling is such a key aspect of programming languages that
@@ -129,45 +124,17 @@ diferent protocol buffer implementation would likely be the first to
 observe the invalid UTF-8, and by that time, a large batch of
 telemetry could fail as a result.
 
-### Responsibility to the API user
-
-When a user uses a telemetry SDK to gain observability into their
-application, it can lead to quite the opposite result when invalid
-UTF-8 causes loss of telemetry.  When a user turns to a telemetry
-system to obseve their system and makes a UTF-8 mistake, the
-consequent total loss of telemetry is a definite bad outcome.  While
-the above discussion addresses attempts to correct a problem,
-OpenTelemetry has not given the API user what they want in this
-situation.
-
-#### Byte-slice valued attribute API
-
-The OpenTelemetry API specification does not permit the use of
-byte-slice valued attributes.  This is one reason why invalid UTF-8
-arises, because OpenTelemetry users have not been given an
-alternative.
-
-When faced with a desire to record invalid UTF-8 in a stream of
-telemetry, users have no good options.  They can use a base64
-encoding, but they will have to do so manually, and with extra code,
-and they are likely to learn this only after first making the mistake
-of using a string-value to encode bytes data.  When users see
-corrupted telemetry data containing � characters, they will return to
-OpenTelemetry with a request: how should they report bytes data?
-
-Ironically, in a prototype Collector processor to repair invalid
-UTF-8, we have used the Zap Logger's `zap.Binary([]byte)` field
-constructor to report the invalid data, and this is something an
-OpenTelemetry API user cannot do.
-
-This OTEP proposes to add an attribute-value type for byte slices, to
-represent an array of bytes with unknown encoding.  Examples data
-values where this attribute could be useful:
-
-- checksum values (e.g., a sha256 checksum represented as 32 bytes)
-- hash values (e.g., a 56-bit hash value represented as 7 bytes)
-- register contents (e.g., a memory region ... 64 bytes)
-- invalid UTF-8 (e.g., data lost because of UTF-8 validation).
+### No byte-slice valued attribute API
+
+As a caveat, the OpenTelemetry project has previously debated and
+rejected the potential to support a byte-slice typed attribute.  This
+potential feature was rejected because it allows API users a way to
+record a possibly uninterpretable sequence of bytes.  Users with
+invalid UTF-8 are left with a few options, for example:
+
+- Base64-encode the invalid data wrapped in human-readable syntax 
+  (e.g., `base64:WNhbHF1ZWVuY29hY2hod`).
+- Transmute the byte slice to an integer slice, which is supported.
 
 ### Responsibility to the Collector user
 
@@ -188,26 +155,34 @@ protect users.  There are several places this could be done:
 3. Prior to an exporter, with data coming from either an external
    source and/or modified by potentially untrustworhty code.
 
-Depending on user preferences, any of these outcomes might be best.
-Every mutating processor could force the pipeline to re-check for
-validity, in the strictest of configurations.
-
-#### New collector component Capabilities for UTF-8 strictness
+Each of these approaches will take significant effort and cost the
+user at runtime, therefore:
 
-Each collector component that strongly requires valid UTF-8 will
-declare so in an Capabilities field, and this will cause the Collector
-to re-validate UTF-8 before invoking that component.  By default,
-exporters will require valid UTF-8.
+- UTF-8 validation SHOULD be enabled by default
+- Users SHOULD be able to opt out of UTF-8 validation
+- A `receiverhelper` library SHOULD offer a function to correct
+  invalid UTF-8 in-place, with two configurable outcomes (1) reject 
+  individual items containing invalid UTF-8, (2) repair invalid UTF-8.
 
-Provided that no processors are strict about UTF-8 validation, UTF-8
-validation will happen only once per pipeline segment.  This means
-UTF-8 validation can be deferred until after sampling and filtering
-operations have finsihed.  Each pipeline segment will have at least
-one UTF-8 validation step, which will either automatically correct the
-problem (default) or reject the individual item of data (optional).
+When an OpenTelemetry collector receives telemetry data in any
+protocol, in cases where the underlying RPC or protocol buffer
+libraries does not guarantee valid UTF-8, the `receiverhelper` library
+SHOULD be called to optionally validate UTF-8 according to the
+confiuration described above.
 
 ### Alternatives considered
 
+#### Exhaustive validation
+
+The Collector behavior proposed above only validates data after it is
+received, not after it is modified by a processor.  We consider the
+risk of a malfunctioning processor to be acceptable.  If this happens,
+it will be considered a bug in the Collector component.  In other
+words, the Collector SHOULD NOT perform internal data validation and
+it SHOULD perform external data validation.
+
+#### Permissive trasnport
+
 We observe that some protocol buffer implementations are permissive
 with respect to invalid UTF-8, while others are strict.  It can be
 risky to combine dissimilar protocol buffer implementations in a

From f23f8b07d0786ae518f6a9ec99e2aded72656021 Mon Sep 17 00:00:00 2001
From: Joshua MacDonald <josh.macdonald@gmail.com>
Date: Thu, 10 Oct 2024 14:52:38 -0700
Subject: [PATCH 5/9] lint

---
 text/0257-utf8-handling.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md
index 9846e32a3..9e9773751 100644
--- a/text/0257-utf8-handling.md
+++ b/text/0257-utf8-handling.md
@@ -120,7 +120,7 @@ Given these out-of-the-box configurations, it would be possible for a
 Rust SDK to send invald UTF-8 to an OpenTelemetry Collector's OTLP
 receiver.  That collector could forward to another OpenTelemtry
 Collector over OTLP successfully, but a non-OTLP exporter using a
-diferent protocol buffer implementation would likely be the first to
+different protocol buffer implementation would likely be the first to
 observe the invalid UTF-8, and by that time, a large batch of
 telemetry could fail as a result.
 
@@ -132,7 +132,7 @@ potential feature was rejected because it allows API users a way to
 record a possibly uninterpretable sequence of bytes.  Users with
 invalid UTF-8 are left with a few options, for example:
 
-- Base64-encode the invalid data wrapped in human-readable syntax 
+- Base64-encode the invalid data wrapped in human-readable syntax
   (e.g., `base64:WNhbHF1ZWVuY29hY2hod`).
 - Transmute the byte slice to an integer slice, which is supported.
 
@@ -149,9 +149,9 @@ protocol buffer library.
 OpenTelemetry Collector SHOULD support automatic UTF-8 validation to
 protect users.  There are several places this could be done:
 
-1. Following a receiver, with data coming from an external source, 
+1. Following a receiver, with data coming from an external source,
 2. Following a mutating processor, with data modified by potentially
-   untrustworthy code, 
+   untrustworthy code,
 3. Prior to an exporter, with data coming from either an external
    source and/or modified by potentially untrustworhty code.
 
@@ -161,7 +161,7 @@ user at runtime, therefore:
 - UTF-8 validation SHOULD be enabled by default
 - Users SHOULD be able to opt out of UTF-8 validation
 - A `receiverhelper` library SHOULD offer a function to correct
-  invalid UTF-8 in-place, with two configurable outcomes (1) reject 
+  invalid UTF-8 in-place, with two configurable outcomes (1) reject
   individual items containing invalid UTF-8, (2) repair invalid UTF-8.
 
 When an OpenTelemetry collector receives telemetry data in any
@@ -181,7 +181,7 @@ it will be considered a bug in the Collector component.  In other
 words, the Collector SHOULD NOT perform internal data validation and
 it SHOULD perform external data validation.
 
-#### Permissive trasnport
+#### Permissive transport
 
 We observe that some protocol buffer implementations are permissive
 with respect to invalid UTF-8, while others are strict.  It can be

From d79c75f780ccf62538d209a5bfdda93de7945ccf Mon Sep 17 00:00:00 2001
From: Joshua MacDonald <josh.macdonald@gmail.com>
Date: Thu, 10 Oct 2024 16:25:53 -0700
Subject: [PATCH 6/9] apply feedback; be specific

---
 text/0257-utf8-handling.md | 60 +++++++++++++++++++++++++++++++-------
 1 file changed, 50 insertions(+), 10 deletions(-)

diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md
index 9e9773751..7457163a3 100644
--- a/text/0257-utf8-handling.md
+++ b/text/0257-utf8-handling.md
@@ -124,12 +124,46 @@ different protocol buffer implementation would likely be the first to
 observe the invalid UTF-8, and by that time, a large batch of
 telemetry could fail as a result.
 
-### No byte-slice valued attribute API
+### Responsibility to the SDK user
+
+The existing specification [dictates a safety mechanism for exporting
+invalid string-valued attributes safely][SAFETY], however it only
+applies to attribute strings:
+
+[SAFETY]: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-type-mapping.md#string-values
+
+```
+## String Values
+String values which are valid UTF-8 sequences SHOULD be converted to AnyValue's string_value field.
+
+String values which are not valid Unicode sequences SHOULD be converted to AnyValue's bytes_value with the bytes representing the string in the original order and format of the source string.
+```
+
+To satisfy the existing error-handling requirements, the OpenTelemetry
+SDK specifications will be modified for all signals with an opt-out
+validation feature.
+
+#### Proposed Collector behavior change
+
+The SDK SHOULD in its default configuration validate all string-valued
+telemetry data fields.  Each run of invalid UTF-8 (i.e., any invalid
+UTF-8 sequences) will be replaced by a single Unicode replacement
+character, `�` (U+FFFD).  The exact behavior of this correction is
+undefined.  When possible, SDKs SHOULD use a built-in library for this
+repair (for example, [Golang's
+`strings.ToValidUTF8()`](https://pkg.go.dev/strings#ToValidUTF8) or
+[Rust's
+`String::to_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy)
+satisfy this requirement).
+
+#### No byte-slice valued attribute API
 
 As a caveat, the OpenTelemetry project has previously debated and
-rejected the potential to support a byte-slice typed attribute.  This
-potential feature was rejected because it allows API users a way to
-record a possibly uninterpretable sequence of bytes.  Users with
+rejected the potential to support a byte-slice typed attribute in
+OpenTelemetry APIs.
+
+This potential feature was rejected because it allows API users a way
+to record a possibly uninterpretable sequence of bytes.  Users with
 invalid UTF-8 are left with a few options, for example:
 
 - Base64-encode the invalid data wrapped in human-readable syntax
@@ -146,8 +180,8 @@ be lost, therefore it seems better for Collector pipelines to
 explicitly handle UTF-8 validation, rather than leave it to the
 protocol buffer library.
 
-OpenTelemetry Collector SHOULD support automatic UTF-8 validation to
-protect users.  There are several places this could be done:
+OpenTelemetry Collector should support automatic UTF-8 validation to
+protect users, however there are several places this could be done:
 
 1. Following a receiver, with data coming from an external source,
 2. Following a mutating processor, with data modified by potentially
@@ -155,14 +189,20 @@ protect users.  There are several places this could be done:
 3. Prior to an exporter, with data coming from either an external
    source and/or modified by potentially untrustworhty code.
 
-Each of these approaches will take significant effort and cost the
-user at runtime, therefore:
+Each of these approaches will take significant effort and vary in cost
+at runtime.
+
+#### Proposed Collector behavior change
+
+To reduce the cost of UTF-8 validation to a minimum, we propose:
 
-- UTF-8 validation SHOULD be enabled by default
+- UTF-8 validation SHOULD be enabled by default for all Receiver components
 - Users SHOULD be able to opt out of UTF-8 validation
 - A `receiverhelper` library SHOULD offer a function to correct
   invalid UTF-8 in-place, with two configurable outcomes (1) reject
-  individual items containing invalid UTF-8, (2) repair invalid UTF-8.
+  individual items containing invalid UTF-8, meaning to count them as
+  rejected spans/points/logs, and (2) repair invalid UTF-8 as specified
+  for SDKs above.
 
 When an OpenTelemetry collector receives telemetry data in any
 protocol, in cases where the underlying RPC or protocol buffer

From e84d1e9dcb2ef6e3d7331c9201ac457077872d77 Mon Sep 17 00:00:00 2001
From: Joshua MacDonald <josh.macdonald@gmail.com>
Date: Thu, 10 Oct 2024 16:26:24 -0700
Subject: [PATCH 7/9] typo

---
 text/0257-utf8-handling.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md
index 7457163a3..4448a7fa5 100644
--- a/text/0257-utf8-handling.md
+++ b/text/0257-utf8-handling.md
@@ -153,7 +153,7 @@ undefined.  When possible, SDKs SHOULD use a built-in library for this
 repair (for example, [Golang's
 `strings.ToValidUTF8()`](https://pkg.go.dev/strings#ToValidUTF8) or
 [Rust's
-`String::to_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy)
+`String::from_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy)
 satisfy this requirement).
 
 #### No byte-slice valued attribute API

From 10721f9fa85ad276ffbbea7f0db2921bf2bc0cc4 Mon Sep 17 00:00:00 2001
From: Joshua MacDonald <josh.macdonald@gmail.com>
Date: Thu, 10 Oct 2024 16:29:59 -0700
Subject: [PATCH 8/9] edits

---
 text/0257-utf8-handling.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md
index 4448a7fa5..faf7c8dc6 100644
--- a/text/0257-utf8-handling.md
+++ b/text/0257-utf8-handling.md
@@ -156,6 +156,8 @@ repair (for example, [Golang's
 `String::from_utf8_lossy()`](https://doc.rust-lang.org/std/string/struct.String.html#method.from_utf8_lossy)
 satisfy this requirement).
 
+SDKs MAY permit users to opt-out of UTF-8 validation for performance reasons.
+
 #### No byte-slice valued attribute API
 
 As a caveat, the OpenTelemetry project has previously debated and
@@ -180,7 +182,7 @@ be lost, therefore it seems better for Collector pipelines to
 explicitly handle UTF-8 validation, rather than leave it to the
 protocol buffer library.
 
-OpenTelemetry Collector should support automatic UTF-8 validation to
+The OpenTelemetry Collector should support automatic UTF-8 validation to
 protect users, however there are several places this could be done:
 
 1. Following a receiver, with data coming from an external source,
@@ -197,7 +199,7 @@ at runtime.
 To reduce the cost of UTF-8 validation to a minimum, we propose:
 
 - UTF-8 validation SHOULD be enabled by default for all Receiver components
-- Users SHOULD be able to opt out of UTF-8 validation
+- Users SHOULD be able to opt-out of UTF-8 validation
 - A `receiverhelper` library SHOULD offer a function to correct
   invalid UTF-8 in-place, with two configurable outcomes (1) reject
   individual items containing invalid UTF-8, meaning to count them as

From 4dbb16f6b60d1a54ee76f31a027ab59324a12c94 Mon Sep 17 00:00:00 2001
From: Joshua MacDonald <josh.macdonald@gmail.com>
Date: Thu, 10 Oct 2024 16:37:14 -0700
Subject: [PATCH 9/9] edit

---
 text/0257-utf8-handling.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/text/0257-utf8-handling.md b/text/0257-utf8-handling.md
index faf7c8dc6..ed6a0731a 100644
--- a/text/0257-utf8-handling.md
+++ b/text/0257-utf8-handling.md
@@ -158,6 +158,11 @@ satisfy this requirement).
 
 SDKs MAY permit users to opt-out of UTF-8 validation for performance reasons.
 
+Optional UTF-8 validation MUST be applied before the attribute
+transformation rule stated for invalid UTF-8 strings.  This means
+UTF-8 validation prevents downgrading to a `bytes_value` encoding
+for string attributes.
+
 #### No byte-slice valued attribute API
 
 As a caveat, the OpenTelemetry project has previously debated and