proto: reject invalid UTF-8 in strings #499

dsnet · 2018-01-25T22:07:21Z

The proto2 and proto3 specifications explicitly say that strings must be
composed of valid UTF-8 characters.

The C++ implementation prints an error anytime strings with invalid UTF-8
is present during parsing or serializing. For proto3, it explicitly
errors out when parsing an invalid string.

Thus, we cause marshal/unmarshal to error if any string fields are invalid.
The error returned is fail-fast and indistinguishable. This means that the
we stop the unmarshal process the moment we detect an invalid string,
as opposed to finishing the unmarshal. An indistinguishable error means that
we provide no API for the user to programmatically check specifically for
invalid UTF-8 errors so that they can ignore it.

In conversations with the protobuf team, they felt strongly that there should be
no ability for users to ignore the UTF-8 validation error. However, if this change is
overly problematic for users, we can consider a workaround for users already
depending on strings containing invalid UTF-8.

The proto2 and proto3 specifications explicitly say that strings must be composed of valid UTF-8 characters. Thus, report a marshal/unmarshal error if any string fields are invalid. The C++ implementation prints an error anytime invalid UTF-8 strings is present during parsing or serializing. For proto3, it explicitly errors out when parsing an invalid string. In Go, we'll attempt to apply this check on proto2 and proto3, but may relax that constraint depending on the extent of problems.

puellanivis · 2018-01-26T02:01:57Z

proto/table_unmarshal.go

@@ -1514,6 +1515,9 @@ func unmarshalStringValue(b []byte, f pointer, w int) ([]byte, error) {
 		return nil, io.ErrUnexpectedEOF
 	}
 	v := string(b[:x])
+	if !utf8.ValidString(v) {


Just a thought, but we could test the values with utf8.Valid before converting them to a string.

I mean, the code is essentially the same, but if we can validate it sooner, why not?

There's no performance penalty for utf8.ValidString since it doesn't convert the string to a []byte and call utf8.Valid under the hood.

I'm going to keep it as is since utf8.ValidString(v) is more readable than utf8.Valid(b[:x]), where you don't have to consciously think about slicing b.

Indeed, good call. 👍

dsymonds · 2018-05-23T23:40:35Z

This seems like a mistake. It is correct according to the spec, but, as you noticed, C++ doesn't actually enforce it for proto2, allowing it to interoperate with bad clients. One could make the argument that the spec, in practice, is whatever the C++ implementation does. I suggest you revert this for proto2 string fields until C++ actually enforces it.

dsnet · 2018-05-23T23:58:54Z

\cc @xfxyjwf. Should Go drop UTF-8 enforcement on proto2 strings because that's what C++ does, as opposed to what the spec specifies? This won't be the first time Go follows C++ and finds out what C++ does is actually wrong and should not be emulated.

dsymonds · 2018-05-24T00:03:00Z

It's hard to argue that C++ violating the spec is particular bad to follow. We've been burned too many times from people complaining that something works with C++ but not with Go, despite it not strictly following the spec. It's skating uphill for Go to be stricter than C++ in ways that people will trip over. For most of the same reasons that C++ isn't following the spec, Go shouldn't too.

dsnet · 2018-05-24T04:31:37Z

We've been burned in both directions.

For example, C++ unmarshals known fields with an incorrect wire type as an unknown field, while Go had rejected them as a parse error. People complained and my investigation (including asking the proto team) showed that parsing as an unknown field was the right behavior. So I made the change (#511) to be more compliant with the other languages, but then other Go users started complaining that they were relying on the parse error. Everything we do in protos breaks someone and it's fairly tiring defending our actions to all parties involved.

In this situation, I'm personally convinced holding to the spec is the correct stance. I checked what Python does and they also fail on both proto2 and proto3 for invalid UTF-8. I didn't check what Java does, but I'm nearly certain they fail as well (given that Java strings cannot represent arbitrary binary).

Of course, that's my opinion. I'll let the proto team make the final call.

…lid UTF-8. For more information see: - golang/protobuf#484 - golang/protobuf#499

* Prevent logs agent from submitting protocol buffer payloads with invalid UTF-8. For more information see: - golang/protobuf#484 - golang/protobuf#499 * Reno

dsnet · 2018-05-30T20:40:24Z

Proto team says to drop enforcement for proto2, but keep it for proto3. See #622.

dsnet requested a review from cybrcodr January 25, 2018 22:07

puellanivis reviewed Jan 26, 2018

View reviewed changes

cybrcodr approved these changes Jan 26, 2018

View reviewed changes

dsnet merged commit 3525335 into dev Jan 26, 2018

dsnet deleted the validate-utf8 branch January 26, 2018 23:54

dsnet mentioned this pull request Jan 29, 2018

Invalid UTF-8 in strings are not rejected #484

Closed

sbuss mentioned this pull request May 17, 2018

Proto Error returned when using aetest context -- proto: invalid UTF-8 string golang/appengine#136

Closed

sbuss mentioned this pull request May 25, 2018

datastore: []byte() cannot be saved or loaded if it contains invalid UTF-8 bytes golang/appengine#143

Closed

mnshdw pushed a commit to DataDog/datadog-agent that referenced this pull request May 28, 2018

Prevent logs agent from submitting protocol buffer payloads with inva…

04af62c

…lid UTF-8. For more information see: - golang/protobuf#484 - golang/protobuf#499

AlexJF mentioned this pull request Jun 6, 2018

model/normalizer: Enforce valid utf-8 during normalization. DataDog/datadog-trace-agent#431

Merged

sbuss mentioned this pull request Jul 26, 2018

Create new release #659

Closed

golang locked and limited conversation to collaborators Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proto: reject invalid UTF-8 in strings #499

proto: reject invalid UTF-8 in strings #499

dsnet commented Jan 25, 2018 •

edited

Loading

puellanivis Jan 26, 2018

dsnet Jan 26, 2018

puellanivis Jan 29, 2018

dsymonds commented May 23, 2018

dsnet commented May 23, 2018

dsymonds commented May 24, 2018

dsnet commented May 24, 2018

dsnet commented May 30, 2018

proto: reject invalid UTF-8 in strings #499

proto: reject invalid UTF-8 in strings #499

Conversation

dsnet commented Jan 25, 2018 • edited Loading

puellanivis Jan 26, 2018

Choose a reason for hiding this comment

dsnet Jan 26, 2018

Choose a reason for hiding this comment

puellanivis Jan 29, 2018

Choose a reason for hiding this comment

dsymonds commented May 23, 2018

dsnet commented May 23, 2018

dsymonds commented May 24, 2018

dsnet commented May 24, 2018

dsnet commented May 30, 2018

dsnet commented Jan 25, 2018 •

edited

Loading