When producing character literals, surrogate characters should be escaped. #20720

gafter · 2017-07-07T23:39:35Z

Customer scenario

Use the Roslyn compiler APIs to produce a literal for a unicode character. The generated syntax should be correct (escaped). The bug is that the compiler produces a unicode surrogate directly into the generated program text.

Bugs this fixes:

Fixes #20693

Workarounds, if any

Client code that produces literals could have a special-case to work around the compiler bug.

Risk

Tiny, as the only effect is to handle this previously unhandled unicode category.

Performance impact

Tiny, if any, as it adds no new code paths.

Is this a regression from a previous update?

No, an old Roslyn bug only recently noticed.

Root cause analysis:

No test coverage for this category of unicode character in literals.

How was the bug found?

Customer reported.

@dotnet/roslyn-compiler May I please have a couple of reviews of this tiny bug fix?

…aped. Fixes dotnet#20693

jcouv · 2017-07-10T15:43:29Z

Looks like legitimate test failures.

jaredpar · 2017-07-11T03:15:24Z

src/ExpressionEvaluator/CSharp/Test/ResultProvider/ValueFormattingTests.cs

-            Assert.Equal(string.Format(format, "🏈"), FormatValue(multiByte));
-            Assert.Equal(string.Format(format, "🏈"), FormatValue(multiByte, useHexadecimal: true));
+            Assert.Equal(string.Format(format, "\\ud83c\\udfc8"), FormatValue(multiByte));
+            Assert.Equal(string.Format(format, "\\ud83c\\udfc8"), FormatValue(multiByte, useHexadecimal: true));


This looks like we're introducing a breaking change. What is the justification here?

Emoji is a printable character. I'd like you not to escape them.

The justification is that the implementation is incorrect, and the test carefully tests only one case for which the incorrect implementation happens to work. If you test the existing implementation on a random sequence of data it will be wrong more than 1000 times for every time it gets it right.

The escaping logic in the compilers is today computed on a character-by-character basis. There is not enough information from a single surrogate character to determine whether or not escaping it is necessary, so escaping it is the only correct approach without considering more context.

It might be worth changing the escaping logic by making it more complex such that it scans the sequence of characters and considers each character in the context of the characters that precede it and follow it to determine if escaping is necessary. I don't think that is worth the effort.

I should also note that the compilers are, generally, not programmed to handle surrogate pairs. Some surrogate pairs are considered letters, but neither the C# nor the VB compiler recognize them as such in identifiers. There is, I believe, an open request to fix that. Making the compilers surrogate-aware is a much larger task than just perfecting how we escape them in literals.

The justification is that the implementation is incorrect, and the test carefully tests only one case for which the incorrect implementation happens to work.

The implementation is incorrect but people may be depending on it. Should we be taking this through compat council?

The direct translation of Unicode surrogates to UTF-8 fails, which is an obstacle to writing source into a UTF-8 encoded file

Also, I understand this limitation, but not why it is of a concern to the language or the compiler. It becomes a limitation only to the user in that they have to save the file as UTF-16 or UTF-32. Saving it as UTF-8 is invalid and would be an error in the IDE. If the IDE did save it and the C# compiler retrieved such a source file, we would likely error during parsing.

@tannergooding The code in question here is not strictly part of the compiler or language implementation. It is part of a set of APIs design to assist in producing source code. As such it must be concerned with whether the source code it produces can be saved in the most common encodings.

Thanks for calling char.GetUnicodeCategory(string, int) to my attention. That may be helpful here.

I should point out that the documentation for char.GetUnicodeCategory(string, int) does not describe its treatment of surrogate pairs representing Unicode code-points. It only describes its behavior for characters. These are different because a surrogate pair is a pair of characters representing a single Unicode code-point.

Yes, the documentation is somewhat lacking. The implementation, however, is here: https://source.dot.net/#System.Private.CoreLib/shared/System/Char.cs,517b834a1717ca04.

It ends up converting the character at index to Utf32 (https://source.dot.net/#System.Private.CoreLib/src/System/Globalization/CharUnicodeInfo.cs,553115e31e8da9c0) and then passes the resulting int to the internal lookup table.

The regular char.GetUnicodeCategory(char) function does effectively the same but cannot handle surrogate characters due to the limitations of char (which can only represent a single utf-16 codepoint).

It is part of a set of APIs design to assist in producing source code.

It suddenly makes a lot more sense 😄

gafter · 2017-07-13T05:10:45Z

@jaredpar Do you have any further comments?

@dotnet/roslyn-compiler May I please have a second review of this small change?

gafter · 2017-07-14T23:04:14Z

OK, I’ve modified the PR to use a “long Unicode” escape sequence when a nicely-paired set of surrogate characters appears in the input.

gafter · 2017-07-15T06:25:20Z

@tannergooding suggested using char.GetUnicodeCategory(string, int) to determine if a surrogate pair represents a printable Unicode code-point (which could therefore be placed directly into the resulting string). I'll revise the PR to try that.

gafter · 2017-07-16T00:19:17Z

OK, printable surrogate pairs are preserved as such in the resulting string.

khyperia · 2017-07-18T15:34:57Z

OK, I’ve modified the PR to use a “long Unicode” escape sequence when a nicely-paired set of surrogate characters appears in the input.

This seems like it has subtle behavior implications that I want to bring up and make sure it's the intended decision - I think it gets complicated because .net is always(?) utf16. Please ask for clarification if my example is unclear (both here and offline work). The two behaviors before/after this change are:

Before:

The user passes in a string containing a high unicode codepoint, requiring surrogates in utf16.
Surrogate pairs are not merged.
They are emitted as \u[high]\u[low].
The compiler later parses this. Nothing changes (still represented as surrogate pair).
The resulting assembly has a surrogate pair.

After:

The user passes in a string containing a high unicode codepoint, requiring surrogates in utf16.
Surrogate pairs are merged.
It is emitted as \U[codepoint].
The compiler later parses this. The high-codepoint is converted to a surrogate pair, to be represented in .net as a utf16 string.
The resulting assembly has a surrogate pair.

The final behavior is identical, but intermediate steps differ. Imagining a hypothetical C# compiler/runtime that used, say, utf8 instead of utf16 (is such a thing even possible by spec?), the output binary would differ, as the utf8 compiler would not split the codepoint into surrogate pairs, but rather into utf8 encoding. I think this is the desired behavior (as it's the sane behavior, as the alternative (without surrogate-merging) is the utf8 compiler encodes the surrogate pairs in utf8, which is illegal), but I wanted to call it out and make sure I/we understand the change.

(Offtopic: is there a more precise term for "high unicode codepoint", meaning "a codepoint that requires surrogate characters when represented in utf16"? Also, please correct vague/incorrect terminology if you notice it in the above text, I'm still trying to learn the complexities of text encodings)

cston · 2017-07-18T15:44:53Z

src/Compilers/CSharp/Portable/SymbolDisplay/ObjectDisplay.cs

+                    }
+                    else if (NeedsEscaping(category))
+                    {
+                        var unicode = CombineSurrogates(c, value[++i]);


Could use char.ConvertToUtf32(value, i) or char.ConvertToUtf32(value[i], value[i + 1]).

Didn't know about that API; thanks!

cston · 2017-07-18T15:55:06Z

src/Compilers/VisualBasic/Portable/SymbolDisplay/ObjectDisplay.vb

@@ -462,6 +469,12 @@ Namespace Microsoft.CodeAnalysis.VisualBasic.ObjectDisplay
                        Yield Quotes()
                    Else
                        Yield Character(c)
+                        If copyPair Then
+                            ' copy the second character of a unciode surrogate pair


Typo: Unicode

cston · 2017-07-18T16:24:22Z

src/Compilers/VisualBasic/Test/Symbol/SymbolDisplay/ObjectDisplayTests.vb

+            Assert.Equal("ChrW(8232) & ""x""", literal.Text)
+            literal = SyntaxFactory.Literal(ChrW(&HDBFF)) ' U+DBFF is a unicode surrogate
+            Assert.Equal("ChrW(56319)", literal.Text)
+        End Sub


Consider testing ObjectDisplay.FormatLiteral directly. Perhaps:

Assert.Equal("...", ObjectDisplay.FormatLiteral( ChrW(&HD83C) & ChrW(&HDFC8), ObjectDisplayOptions.UseQuotes Or ObjectDisplayOptions.EscapeNonPrintableCharacters))

And a similar test for C# ObjectDisplayTests:

Assert.Equal("...", ObjectDisplay.FormatLiteral("\ud83c", ObjectDisplayOptions.EscapeNonPrintableCharacters)); Assert.Equal("...", ObjectDisplay.FormatLiteral("\ud83c\udfc8", ObjectDisplayOptions.EscapeNonPrintableCharacters));

ObjectDisplay is not a public API. But it is used in SymbolDisplay.FormatPrimitive, which I can test.

The methods in this class are directly testing ObjectDisplay.

gafter · 2017-07-18T18:12:34Z

@khyperia You are not correctly stating the "before" behavior. We previously did not escape (or even detect the presence of) surrogates in any way. We just copied them into the resulting string, not even checking if they are properly paired. That was the bug being fixed here.

gafter · 2017-07-18T18:16:51Z

I should also mention that, conceptually, there is no such thing as a utf8 representation of a surrogate character. Surrogates are part of a utf16 representation of "wide" characters, and utf8 has its own representation for wide characters. But a utf8 sequence representing that utf16 sequence is problematic. See http://unicodebook.readthedocs.io/issues.html#strict-utf8-decoder for details.

khyperia · 2017-07-18T20:27:48Z

You are not correctly stating the "before" behavior.

Right, yes, my bad - I was a bit vague, I meant "before this specific commit in particular", or perhaps rather "this entire PR but with the surrogate joining left out".

I think my comment boils down to: what happens when you round-trip source-to-source? (parse, pass to this API, emit to source again - at least I am assuming that is a valid operation for this API, I'm not exactly sure what it's doing)

Imagine we have two strings:

var x = "\ud83c\udfc8";

and

var x = "\U0001F3C8";

(both should be the same football emoji, so just imagine if they're not 🙂)

When converted to binary forms (instead of ascii escape), the two strings are indistinguishable. So later, when round-tripping back to source, how do we know which form to emit in this API? This PR's answer is "always choose the merged version".

I think the more I think about it, the more I realize that emitting the paired version is never the right thing to do. I'll leave the above thoughts, because it takes a bit to get to that conclusion, and I want to leave a trail for anyone else to follow along. (I've still not 100% convinced myself, but I think that's just me being stubborn)

That conclusion leaves some interesting questions surrounding other places, though - e.g. do we do unicode analysis on string literals? What do we do if we encounter invalid unicode in a string literal?
What do we do when we encounter a valid utf16 surrogate pair encoded in utf8 source (which seems very interesting if we do not error on invalid utf8 source)? (These are questions that are irrelevant to this PR, I'm just genuinely curious and wondering out loud)

gafter · 2017-07-18T23:59:37Z

I think my comment boils down to: what happens when you round-trip source-to-source?

When the literal comes from source, we do not compute what source should be used for it. We use whatever was written in the actual source, unchanged, because we save it as part of the token. This code only comes into play when you hand-build tokens without providing source, letting the compiler APIs select a source representation for you.

When we are reading a utf8 source file, we must decode it into our internal utf16 representation. A "utf8 representation of a Unicode surrogate" isn't legal utf8, so I would expect the utf8 decoder to choke on the source. utf8 has its own representation for wide Unicode codepoints, which gets converted to a surrogate pair when decoded into utf16.

…-20693

cston · 2017-07-28T22:37:31Z

LGTM

gafter · 2017-07-28T23:09:24Z

test windows_debug_vs-integration_prtest please

gafter · 2017-07-29T00:20:56Z

@khyperia Do you approve of this PR in its current form?

khyperia · 2017-07-30T14:52:06Z

LGTM.

I would expect the utf8 decoder to choke on the source.

Some quick testing seems like there's a couple bugs around this. For example, we have the docs If the source code files were created with the same codepage that is in effect on your computer or if the source code files were created with UNICODE or UTF-8, you need not use /codepage., but I've found an example where not specifying codepage and specifying /codepage:65001 (UTF-8) produces different behavior. (I think the eventual logic boils down to crazy stuff happening deep inside the Encoding class in the stdlib). That's completely irrelevant to this PR, though, I just thought it was interesting.

…sion-expression-rewrite * dotnet/features/ioperation: (229 commits) Adding Ioperation tests for WhileUntilLoopStatement (dotnet#21047) marked assert that needs to be re-enabled addressing more PR feedbacks PR feedback Remove InvocationReasons enum boxing PR feedbacks Expose if a Binary/Unary operator was 'Lifted' at the IExpression level. (dotnet#14779) addressing PR feedback added comments Update VS Integration machines to 15.3 Preview 6 (dotnet#21240) fixed typo Fixed dotnet#18763 Compiler crash on bad code in the IDE (dotnet#20903) Fix typo in ERR_RefReturnParameter2 (dotnet#21235) Fix unbound recursion with const var field in script (dotnet#21223) Typo fix (dotnet#20513) PR feedbacks and added some more tests When producing character literals, surrogate characters should be escaped. (dotnet#20720) Fix build correctness issues Fix possible null reference warnings Adding ioperation tests for ForEachStatement (dotnet#21048) ...

When producing character literals, surrogate characters should be esc…

70dfeff

…aped. Fixes dotnet#20693

dnfclas added the cla-already-signed label Jul 7, 2017

gafter added 4 - In Review A fix for the issue is submitted for review. Area-Compilers Bug labels Jul 7, 2017

gafter added this to the 15.5 milestone Jul 7, 2017

Update existing tests affected by treatment of unicode surrogates

fa8cf30

gafter self-assigned this Jul 11, 2017

jaredpar reviewed Jul 11, 2017

View reviewed changes

gafter mentioned this pull request Jul 11, 2017

Literal should escape surrogate unicode #20693

Closed

Use a long escape for surrogate pairs.

ee74052

Preserve surrogate pairs that represent printable Unicode codepoints.

17bf1cd

gafter closed this Jul 16, 2017

gafter reopened this Jul 16, 2017

dnfclas added the cla-already-signed label Jul 16, 2017

Copy properly paired surrogates when producing VB source literals.

16b82ff

cston reviewed Jul 18, 2017

View reviewed changes

gafter added 3 commits July 28, 2017 12:47

Merge branch 'master' of https://github.com/dotnet/roslyn into master…

99ef96c

…-20693

Additional tests for VB's ObjectDisplay

377d628

Test ObjectDisplay directly.

cf8cdb7

gafter merged commit e7869bd into dotnet:master Jul 31, 2017

gafter added Resolution-Fixed The bug has been fixed and/or the requested behavior has been implemented and removed 4 - In Review A fix for the issue is submitted for review. labels Jul 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When producing character literals, surrogate characters should be escaped. #20720

When producing character literals, surrogate characters should be escaped. #20720

gafter commented Jul 7, 2017

jcouv commented Jul 10, 2017

jaredpar Jul 11, 2017

ufcpp Jul 11, 2017

gafter Jul 11, 2017

gafter Jul 11, 2017

jaredpar Jul 13, 2017

tannergooding Jul 15, 2017

gafter Jul 15, 2017

gafter Jul 15, 2017

tannergooding Jul 15, 2017

tannergooding Jul 15, 2017

gafter commented Jul 13, 2017

gafter commented Jul 14, 2017

gafter commented Jul 15, 2017

gafter commented Jul 16, 2017

khyperia commented Jul 18, 2017

cston Jul 18, 2017

gafter Jul 28, 2017

cston Jul 18, 2017

cston Jul 18, 2017

gafter Jul 28, 2017

cston Jul 28, 2017

gafter commented Jul 18, 2017

gafter commented Jul 18, 2017

khyperia commented Jul 18, 2017

gafter commented Jul 18, 2017

cston commented Jul 28, 2017

gafter commented Jul 28, 2017

gafter commented Jul 29, 2017

khyperia commented Jul 30, 2017

When producing character literals, surrogate characters should be escaped. #20720

When producing character literals, surrogate characters should be escaped. #20720

Conversation

gafter commented Jul 7, 2017

jcouv commented Jul 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gafter commented Jul 13, 2017

gafter commented Jul 14, 2017

gafter commented Jul 15, 2017

gafter commented Jul 16, 2017

khyperia commented Jul 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gafter commented Jul 18, 2017

gafter commented Jul 18, 2017

khyperia commented Jul 18, 2017

gafter commented Jul 18, 2017

cston commented Jul 28, 2017

gafter commented Jul 28, 2017

gafter commented Jul 29, 2017

khyperia commented Jul 30, 2017