Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When producing character literals, surrogate characters should be escaped. #20720

Merged
merged 8 commits into from
Jul 31, 2017

Conversation

gafter
Copy link
Member

@gafter gafter commented Jul 7, 2017

Customer scenario

Use the Roslyn compiler APIs to produce a literal for a unicode character. The generated syntax should be correct (escaped). The bug is that the compiler produces a unicode surrogate directly into the generated program text.

Bugs this fixes:

Fixes #20693

Workarounds, if any

Client code that produces literals could have a special-case to work around the compiler bug.

Risk

Tiny, as the only effect is to handle this previously unhandled unicode category.

Performance impact

Tiny, if any, as it adds no new code paths.

Is this a regression from a previous update?

No, an old Roslyn bug only recently noticed.

Root cause analysis:

No test coverage for this category of unicode character in literals.

How was the bug found?

Customer reported.

@dotnet/roslyn-compiler May I please have a couple of reviews of this tiny bug fix?

@gafter gafter added 4 - In Review A fix for the issue is submitted for review. Area-Compilers Bug labels Jul 7, 2017
@gafter gafter added this to the 15.5 milestone Jul 7, 2017
@jcouv
Copy link
Member

jcouv commented Jul 10, 2017

Looks like legitimate test failures.

@gafter gafter self-assigned this Jul 11, 2017
Assert.Equal(string.Format(format, "🏈"), FormatValue(multiByte));
Assert.Equal(string.Format(format, "🏈"), FormatValue(multiByte, useHexadecimal: true));
Assert.Equal(string.Format(format, "\\ud83c\\udfc8"), FormatValue(multiByte));
Assert.Equal(string.Format(format, "\\ud83c\\udfc8"), FormatValue(multiByte, useHexadecimal: true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like we're introducing a breaking change. What is the justification here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emoji is a printable character. I'd like you not to escape them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The justification is that the implementation is incorrect, and the test carefully tests only one case for which the incorrect implementation happens to work. If you test the existing implementation on a random sequence of data it will be wrong more than 1000 times for every time it gets it right.

The escaping logic in the compilers is today computed on a character-by-character basis. There is not enough information from a single surrogate character to determine whether or not escaping it is necessary, so escaping it is the only correct approach without considering more context.

It might be worth changing the escaping logic by making it more complex such that it scans the sequence of characters and considers each character in the context of the characters that precede it and follow it to determine if escaping is necessary. I don't think that is worth the effort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also note that the compilers are, generally, not programmed to handle surrogate pairs. Some surrogate pairs are considered letters, but neither the C# nor the VB compiler recognize them as such in identifiers. There is, I believe, an open request to fix that. Making the compilers surrogate-aware is a much larger task than just perfecting how we escape them in literals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The justification is that the implementation is incorrect, and the test carefully tests only one case for which the incorrect implementation happens to work.

The implementation is incorrect but people may be depending on it. Should we be taking this through compat council?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direct translation of Unicode surrogates to UTF-8 fails, which is an obstacle to writing source into a UTF-8 encoded file

Also, I understand this limitation, but not why it is of a concern to the language or the compiler. It becomes a limitation only to the user in that they have to save the file as UTF-16 or UTF-32. Saving it as UTF-8 is invalid and would be an error in the IDE. If the IDE did save it and the C# compiler retrieved such a source file, we would likely error during parsing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding The code in question here is not strictly part of the compiler or language implementation. It is part of a set of APIs design to assist in producing source code. As such it must be concerned with whether the source code it produces can be saved in the most common encodings.

Thanks for calling char.GetUnicodeCategory(string, int) to my attention. That may be helpful here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should point out that the documentation for char.GetUnicodeCategory(string, int) does not describe its treatment of surrogate pairs representing Unicode code-points. It only describes its behavior for characters. These are different because a surrogate pair is a pair of characters representing a single Unicode code-point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the documentation is somewhat lacking. The implementation, however, is here: https://source.dot.net/#System.Private.CoreLib/shared/System/Char.cs,517b834a1717ca04.

It ends up converting the character at index to Utf32 (https://source.dot.net/#System.Private.CoreLib/src/System/Globalization/CharUnicodeInfo.cs,553115e31e8da9c0) and then passes the resulting int to the internal lookup table.

The regular char.GetUnicodeCategory(char) function does effectively the same but cannot handle surrogate characters due to the limitations of char (which can only represent a single utf-16 codepoint).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of a set of APIs design to assist in producing source code.

It suddenly makes a lot more sense 😄

@gafter
Copy link
Member Author

gafter commented Jul 13, 2017

@jaredpar Do you have any further comments?

@dotnet/roslyn-compiler May I please have a second review of this small change?

@gafter
Copy link
Member Author

gafter commented Jul 14, 2017

OK, I’ve modified the PR to use a “long Unicode” escape sequence when a nicely-paired set of surrogate characters appears in the input.

@gafter
Copy link
Member Author

gafter commented Jul 15, 2017

@tannergooding suggested using char.GetUnicodeCategory(string, int) to determine if a surrogate pair represents a printable Unicode code-point (which could therefore be placed directly into the resulting string). I'll revise the PR to try that.

@gafter
Copy link
Member Author

gafter commented Jul 16, 2017

OK, printable surrogate pairs are preserved as such in the resulting string.

@khyperia
Copy link
Contributor

OK, I’ve modified the PR to use a “long Unicode” escape sequence when a nicely-paired set of surrogate characters appears in the input.

This seems like it has subtle behavior implications that I want to bring up and make sure it's the intended decision - I think it gets complicated because .net is always(?) utf16. Please ask for clarification if my example is unclear (both here and offline work). The two behaviors before/after this change are:

Before:

  1. The user passes in a string containing a high unicode codepoint, requiring surrogates in utf16.
  2. Surrogate pairs are not merged.
  3. They are emitted as \u[high]\u[low].
  4. The compiler later parses this. Nothing changes (still represented as surrogate pair).
  5. The resulting assembly has a surrogate pair.

After:

  1. The user passes in a string containing a high unicode codepoint, requiring surrogates in utf16.
  2. Surrogate pairs are merged.
  3. It is emitted as \U[codepoint].
  4. The compiler later parses this. The high-codepoint is converted to a surrogate pair, to be represented in .net as a utf16 string.
  5. The resulting assembly has a surrogate pair.

The final behavior is identical, but intermediate steps differ. Imagining a hypothetical C# compiler/runtime that used, say, utf8 instead of utf16 (is such a thing even possible by spec?), the output binary would differ, as the utf8 compiler would not split the codepoint into surrogate pairs, but rather into utf8 encoding. I think this is the desired behavior (as it's the sane behavior, as the alternative (without surrogate-merging) is the utf8 compiler encodes the surrogate pairs in utf8, which is illegal), but I wanted to call it out and make sure I/we understand the change.

(Offtopic: is there a more precise term for "high unicode codepoint", meaning "a codepoint that requires surrogate characters when represented in utf16"? Also, please correct vague/incorrect terminology if you notice it in the above text, I'm still trying to learn the complexities of text encodings)

}
else if (NeedsEscaping(category))
{
var unicode = CombineSurrogates(c, value[++i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use char.ConvertToUtf32(value, i) or char.ConvertToUtf32(value[i], value[i + 1]).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about that API; thanks!

@@ -462,6 +469,12 @@ Namespace Microsoft.CodeAnalysis.VisualBasic.ObjectDisplay
Yield Quotes()
Else
Yield Character(c)
If copyPair Then
' copy the second character of a unciode surrogate pair
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Unicode

Assert.Equal("ChrW(8232) & ""x""", literal.Text)
literal = SyntaxFactory.Literal(ChrW(&HDBFF)) ' U+DBFF is a unicode surrogate
Assert.Equal("ChrW(56319)", literal.Text)
End Sub
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider testing ObjectDisplay.FormatLiteral directly. Perhaps:

Assert.Equal("...",
    ObjectDisplay.FormatLiteral(
        ChrW(&HD83C) & ChrW(&HDFC8),
        ObjectDisplayOptions.UseQuotes Or ObjectDisplayOptions.EscapeNonPrintableCharacters))

And a similar test for C# ObjectDisplayTests:

Assert.Equal("...",
    ObjectDisplay.FormatLiteral("\ud83c", ObjectDisplayOptions.EscapeNonPrintableCharacters));
Assert.Equal("...",
    ObjectDisplay.FormatLiteral("\ud83c\udfc8", ObjectDisplayOptions.EscapeNonPrintableCharacters));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectDisplay is not a public API. But it is used in SymbolDisplay.FormatPrimitive, which I can test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The methods in this class are directly testing ObjectDisplay.

@gafter
Copy link
Member Author

gafter commented Jul 18, 2017

@khyperia You are not correctly stating the "before" behavior. We previously did not escape (or even detect the presence of) surrogates in any way. We just copied them into the resulting string, not even checking if they are properly paired. That was the bug being fixed here.

@gafter
Copy link
Member Author

gafter commented Jul 18, 2017

I should also mention that, conceptually, there is no such thing as a utf8 representation of a surrogate character. Surrogates are part of a utf16 representation of "wide" characters, and utf8 has its own representation for wide characters. But a utf8 sequence representing that utf16 sequence is problematic. See http://unicodebook.readthedocs.io/issues.html#strict-utf8-decoder for details.

@khyperia
Copy link
Contributor

You are not correctly stating the "before" behavior.

Right, yes, my bad - I was a bit vague, I meant "before this specific commit in particular", or perhaps rather "this entire PR but with the surrogate joining left out".


I think my comment boils down to: what happens when you round-trip source-to-source? (parse, pass to this API, emit to source again - at least I am assuming that is a valid operation for this API, I'm not exactly sure what it's doing)

Imagine we have two strings:

var x = "\ud83c\udfc8";

and

var x = "\U0001F3C8";

(both should be the same football emoji, so just imagine if they're not 🙂)

When converted to binary forms (instead of ascii escape), the two strings are indistinguishable. So later, when round-tripping back to source, how do we know which form to emit in this API? This PR's answer is "always choose the merged version".


I think the more I think about it, the more I realize that emitting the paired version is never the right thing to do. I'll leave the above thoughts, because it takes a bit to get to that conclusion, and I want to leave a trail for anyone else to follow along. (I've still not 100% convinced myself, but I think that's just me being stubborn)

That conclusion leaves some interesting questions surrounding other places, though - e.g. do we do unicode analysis on string literals? What do we do if we encounter invalid unicode in a string literal?
What do we do when we encounter a valid utf16 surrogate pair encoded in utf8 source (which seems very interesting if we do not error on invalid utf8 source)? (These are questions that are irrelevant to this PR, I'm just genuinely curious and wondering out loud)

@gafter
Copy link
Member Author

gafter commented Jul 18, 2017

I think my comment boils down to: what happens when you round-trip source-to-source?

When the literal comes from source, we do not compute what source should be used for it. We use whatever was written in the actual source, unchanged, because we save it as part of the token. This code only comes into play when you hand-build tokens without providing source, letting the compiler APIs select a source representation for you.

When we are reading a utf8 source file, we must decode it into our internal utf16 representation. A "utf8 representation of a Unicode surrogate" isn't legal utf8, so I would expect the utf8 decoder to choke on the source. utf8 has its own representation for wide Unicode codepoints, which gets converted to a surrogate pair when decoded into utf16.

@cston
Copy link
Member

cston commented Jul 28, 2017

LGTM

@gafter
Copy link
Member Author

gafter commented Jul 28, 2017

test windows_debug_vs-integration_prtest please

@gafter
Copy link
Member Author

gafter commented Jul 29, 2017

@khyperia Do you approve of this PR in its current form?

@khyperia
Copy link
Contributor

LGTM.

I would expect the utf8 decoder to choke on the source.

Some quick testing seems like there's a couple bugs around this. For example, we have the docs If the source code files were created with the same codepage that is in effect on your computer or if the source code files were created with UNICODE or UTF-8, you need not use /codepage., but I've found an example where not specifying codepage and specifying /codepage:65001 (UTF-8) produces different behavior. (I think the eventual logic boils down to crazy stuff happening deep inside the Encoding class in the stdlib). That's completely irrelevant to this PR, though, I just thought it was interesting.

@gafter gafter merged commit e7869bd into dotnet:master Jul 31, 2017
@gafter gafter added Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented and removed 4 - In Review A fix for the issue is submitted for review. labels Jul 31, 2017
333fred added a commit to 333fred/roslyn that referenced this pull request Aug 4, 2017
…sion-expression-rewrite

* dotnet/features/ioperation: (229 commits)
  Adding Ioperation tests for WhileUntilLoopStatement (dotnet#21047)
  marked assert that needs to be re-enabled
  addressing more PR feedbacks
  PR feedback
  Remove InvocationReasons enum boxing
  PR feedbacks
  Expose if a Binary/Unary operator was 'Lifted' at the IExpression level. (dotnet#14779)
  addressing PR feedback
  added comments
  Update VS Integration machines to 15.3 Preview 6 (dotnet#21240)
  fixed typo
  Fixed dotnet#18763 Compiler crash on bad code in the IDE (dotnet#20903)
  Fix typo in ERR_RefReturnParameter2 (dotnet#21235)
  Fix unbound recursion with const var field in script (dotnet#21223)
  Typo fix (dotnet#20513)
  PR feedbacks and added some more tests
  When producing character literals, surrogate characters should be escaped. (dotnet#20720)
  Fix build correctness issues
  Fix possible null reference warnings
  Adding ioperation tests for ForEachStatement (dotnet#21048)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Bug cla-already-signed Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Literal should escape surrogate unicode
8 participants