Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proto: reject invalid UTF-8 in strings #499

Merged
merged 1 commit into from
Jan 26, 2018
Merged

proto: reject invalid UTF-8 in strings #499

merged 1 commit into from
Jan 26, 2018

Conversation

dsnet
Copy link
Member

@dsnet dsnet commented Jan 25, 2018

The proto2 and proto3 specifications explicitly say that strings must be
composed of valid UTF-8 characters.

The C++ implementation prints an error anytime strings with invalid UTF-8
is present during parsing or serializing. For proto3, it explicitly
errors out when parsing an invalid string.

Thus, we cause marshal/unmarshal to error if any string fields are invalid.
The error returned is fail-fast and indistinguishable. This means that the
we stop the unmarshal process the moment we detect an invalid string,
as opposed to finishing the unmarshal. An indistinguishable error means that
we provide no API for the user to programmatically check specifically for
invalid UTF-8 errors so that they can ignore it.

In conversations with the protobuf team, they felt strongly that there should be
no ability for users to ignore the UTF-8 validation error. However, if this change is
overly problematic for users, we can consider a workaround for users already
depending on strings containing invalid UTF-8.

The proto2 and proto3 specifications explicitly say that strings must be
composed of valid UTF-8 characters. Thus, report a marshal/unmarshal error
if any string fields are invalid.

The C++ implementation prints an error anytime invalid UTF-8 strings
is present during parsing or serializing. For proto3, it explicitly
errors out when parsing an invalid string.

In Go, we'll attempt to apply this check on proto2 and proto3,
but may relax that constraint depending on the extent of problems.
@dsnet dsnet requested a review from cybrcodr January 25, 2018 22:07
@@ -1514,6 +1515,9 @@ func unmarshalStringValue(b []byte, f pointer, w int) ([]byte, error) {
return nil, io.ErrUnexpectedEOF
}
v := string(b[:x])
if !utf8.ValidString(v) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, but we could test the values with utf8.Valid before converting them to a string.

I mean, the code is essentially the same, but if we can validate it sooner, why not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no performance penalty for utf8.ValidString since it doesn't convert the string to a []byte and call utf8.Valid under the hood.

I'm going to keep it as is since utf8.ValidString(v) is more readable than utf8.Valid(b[:x]), where you don't have to consciously think about slicing b.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, good call. 👍

@dsymonds
Copy link
Contributor

This seems like a mistake. It is correct according to the spec, but, as you noticed, C++ doesn't actually enforce it for proto2, allowing it to interoperate with bad clients. One could make the argument that the spec, in practice, is whatever the C++ implementation does. I suggest you revert this for proto2 string fields until C++ actually enforces it.

@dsnet
Copy link
Member Author

dsnet commented May 23, 2018

\cc @xfxyjwf. Should Go drop UTF-8 enforcement on proto2 strings because that's what C++ does, as opposed to what the spec specifies? This won't be the first time Go follows C++ and finds out what C++ does is actually wrong and should not be emulated.

@dsymonds
Copy link
Contributor

It's hard to argue that C++ violating the spec is particular bad to follow. We've been burned too many times from people complaining that something works with C++ but not with Go, despite it not strictly following the spec. It's skating uphill for Go to be stricter than C++ in ways that people will trip over. For most of the same reasons that C++ isn't following the spec, Go shouldn't too.

@dsnet
Copy link
Member Author

dsnet commented May 24, 2018

We've been burned in both directions.

For example, C++ unmarshals known fields with an incorrect wire type as an unknown field, while Go had rejected them as a parse error. People complained and my investigation (including asking the proto team) showed that parsing as an unknown field was the right behavior. So I made the change (#511) to be more compliant with the other languages, but then other Go users started complaining that they were relying on the parse error. Everything we do in protos breaks someone and it's fairly tiring defending our actions to all parties involved.

In this situation, I'm personally convinced holding to the spec is the correct stance. I checked what Python does and they also fail on both proto2 and proto3 for invalid UTF-8. I didn't check what Java does, but I'm nearly certain they fail as well (given that Java strings cannot represent arbitrary binary).

Of course, that's my opinion. I'll let the proto team make the final call.

mnshdw added a commit to DataDog/datadog-agent that referenced this pull request May 28, 2018
mnshdw added a commit to DataDog/datadog-agent that referenced this pull request May 28, 2018
* Prevent logs agent from submitting protocol buffer payloads with invalid UTF-8.

For more information see:
- golang/protobuf#484
- golang/protobuf#499

* Reno
@dsnet
Copy link
Member Author

dsnet commented May 30, 2018

Proto team says to drop enforcement for proto2, but keep it for proto3. See #622.

@sbuss sbuss mentioned this pull request Jul 26, 2018
@golang golang locked and limited conversation to collaborators Jun 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants