-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When producing character literals, surrogate characters should be escaped. #20720
Conversation
Looks like legitimate test failures. |
Assert.Equal(string.Format(format, "🏈"), FormatValue(multiByte)); | ||
Assert.Equal(string.Format(format, "🏈"), FormatValue(multiByte, useHexadecimal: true)); | ||
Assert.Equal(string.Format(format, "\\ud83c\\udfc8"), FormatValue(multiByte)); | ||
Assert.Equal(string.Format(format, "\\ud83c\\udfc8"), FormatValue(multiByte, useHexadecimal: true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like we're introducing a breaking change. What is the justification here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emoji is a printable character. I'd like you not to escape them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The justification is that the implementation is incorrect, and the test carefully tests only one case for which the incorrect implementation happens to work. If you test the existing implementation on a random sequence of data it will be wrong more than 1000 times for every time it gets it right.
The escaping logic in the compilers is today computed on a character-by-character basis. There is not enough information from a single surrogate character to determine whether or not escaping it is necessary, so escaping it is the only correct approach without considering more context.
It might be worth changing the escaping logic by making it more complex such that it scans the sequence of characters and considers each character in the context of the characters that precede it and follow it to determine if escaping is necessary. I don't think that is worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should also note that the compilers are, generally, not programmed to handle surrogate pairs. Some surrogate pairs are considered letters, but neither the C# nor the VB compiler recognize them as such in identifiers. There is, I believe, an open request to fix that. Making the compilers surrogate-aware is a much larger task than just perfecting how we escape them in literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The justification is that the implementation is incorrect, and the test carefully tests only one case for which the incorrect implementation happens to work.
The implementation is incorrect but people may be depending on it. Should we be taking this through compat council?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The direct translation of Unicode surrogates to UTF-8 fails, which is an obstacle to writing source into a UTF-8 encoded file
Also, I understand this limitation, but not why it is of a concern to the language or the compiler. It becomes a limitation only to the user in that they have to save the file as UTF-16 or UTF-32. Saving it as UTF-8 is invalid and would be an error in the IDE. If the IDE did save it and the C# compiler retrieved such a source file, we would likely error during parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding The code in question here is not strictly part of the compiler or language implementation. It is part of a set of APIs design to assist in producing source code. As such it must be concerned with whether the source code it produces can be saved in the most common encodings.
Thanks for calling char.GetUnicodeCategory(string, int)
to my attention. That may be helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should point out that the documentation for char.GetUnicodeCategory(string, int)
does not describe its treatment of surrogate pairs representing Unicode code-points. It only describes its behavior for characters. These are different because a surrogate pair is a pair of characters representing a single Unicode code-point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the documentation is somewhat lacking. The implementation, however, is here: https://source.dot.net/#System.Private.CoreLib/shared/System/Char.cs,517b834a1717ca04.
It ends up converting the character at index
to Utf32
(https://source.dot.net/#System.Private.CoreLib/src/System/Globalization/CharUnicodeInfo.cs,553115e31e8da9c0) and then passes the resulting int
to the internal lookup table.
The regular char.GetUnicodeCategory(char)
function does effectively the same but cannot handle surrogate characters due to the limitations of char
(which can only represent a single utf-16 codepoint).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is part of a set of APIs design to assist in producing source code.
It suddenly makes a lot more sense 😄
@jaredpar Do you have any further comments? @dotnet/roslyn-compiler May I please have a second review of this small change? |
OK, I’ve modified the PR to use a “long Unicode” escape sequence when a nicely-paired set of surrogate characters appears in the input. |
@tannergooding suggested using |
OK, printable surrogate pairs are preserved as such in the resulting string. |
This seems like it has subtle behavior implications that I want to bring up and make sure it's the intended decision - I think it gets complicated because .net is always(?) utf16. Please ask for clarification if my example is unclear (both here and offline work). The two behaviors before/after this change are: Before:
After:
The final behavior is identical, but intermediate steps differ. Imagining a hypothetical C# compiler/runtime that used, say, utf8 instead of utf16 (is such a thing even possible by spec?), the output binary would differ, as the utf8 compiler would not split the codepoint into surrogate pairs, but rather into utf8 encoding. I think this is the desired behavior (as it's the sane behavior, as the alternative (without surrogate-merging) is the utf8 compiler encodes the surrogate pairs in utf8, which is illegal), but I wanted to call it out and make sure I/we understand the change. (Offtopic: is there a more precise term for "high unicode codepoint", meaning "a codepoint that requires surrogate characters when represented in utf16"? Also, please correct vague/incorrect terminology if you notice it in the above text, I'm still trying to learn the complexities of text encodings) |
} | ||
else if (NeedsEscaping(category)) | ||
{ | ||
var unicode = CombineSurrogates(c, value[++i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use char.ConvertToUtf32(value, i)
or char.ConvertToUtf32(value[i], value[i + 1])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know about that API; thanks!
@@ -462,6 +469,12 @@ Namespace Microsoft.CodeAnalysis.VisualBasic.ObjectDisplay | |||
Yield Quotes() | |||
Else | |||
Yield Character(c) | |||
If copyPair Then | |||
' copy the second character of a unciode surrogate pair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Unicode
Assert.Equal("ChrW(8232) & ""x""", literal.Text) | ||
literal = SyntaxFactory.Literal(ChrW(&HDBFF)) ' U+DBFF is a unicode surrogate | ||
Assert.Equal("ChrW(56319)", literal.Text) | ||
End Sub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider testing ObjectDisplay.FormatLiteral
directly. Perhaps:
Assert.Equal("...",
ObjectDisplay.FormatLiteral(
ChrW(&HD83C) & ChrW(&HDFC8),
ObjectDisplayOptions.UseQuotes Or ObjectDisplayOptions.EscapeNonPrintableCharacters))
And a similar test for C# ObjectDisplayTests
:
Assert.Equal("...",
ObjectDisplay.FormatLiteral("\ud83c", ObjectDisplayOptions.EscapeNonPrintableCharacters));
Assert.Equal("...",
ObjectDisplay.FormatLiteral("\ud83c\udfc8", ObjectDisplayOptions.EscapeNonPrintableCharacters));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObjectDisplay
is not a public API. But it is used in SymbolDisplay.FormatPrimitive
, which I can test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The methods in this class are directly testing ObjectDisplay
.
@khyperia You are not correctly stating the "before" behavior. We previously did not escape (or even detect the presence of) surrogates in any way. We just copied them into the resulting string, not even checking if they are properly paired. That was the bug being fixed here. |
I should also mention that, conceptually, there is no such thing as a utf8 representation of a surrogate character. Surrogates are part of a utf16 representation of "wide" characters, and utf8 has its own representation for wide characters. But a utf8 sequence representing that utf16 sequence is problematic. See http://unicodebook.readthedocs.io/issues.html#strict-utf8-decoder for details. |
Right, yes, my bad - I was a bit vague, I meant "before this specific commit in particular", or perhaps rather "this entire PR but with the surrogate joining left out". I think my comment boils down to: what happens when you round-trip source-to-source? (parse, pass to this API, emit to source again - at least I am assuming that is a valid operation for this API, I'm not exactly sure what it's doing) Imagine we have two strings:
and
(both should be the same football emoji, so just imagine if they're not 🙂) When converted to binary forms (instead of ascii escape), the two strings are indistinguishable. So later, when round-tripping back to source, how do we know which form to emit in this API? This PR's answer is "always choose the merged version". I think the more I think about it, the more I realize that emitting the paired version is never the right thing to do. I'll leave the above thoughts, because it takes a bit to get to that conclusion, and I want to leave a trail for anyone else to follow along. (I've still not 100% convinced myself, but I think that's just me being stubborn) That conclusion leaves some interesting questions surrounding other places, though - e.g. do we do unicode analysis on string literals? What do we do if we encounter invalid unicode in a string literal? |
When the literal comes from source, we do not compute what source should be used for it. We use whatever was written in the actual source, unchanged, because we save it as part of the token. This code only comes into play when you hand-build tokens without providing source, letting the compiler APIs select a source representation for you. When we are reading a utf8 source file, we must decode it into our internal utf16 representation. A "utf8 representation of a Unicode surrogate" isn't legal utf8, so I would expect the utf8 decoder to choke on the source. utf8 has its own representation for wide Unicode codepoints, which gets converted to a surrogate pair when decoded into utf16. |
LGTM |
test windows_debug_vs-integration_prtest please |
@khyperia Do you approve of this PR in its current form? |
LGTM.
Some quick testing seems like there's a couple bugs around this. For example, we have the docs |
…sion-expression-rewrite * dotnet/features/ioperation: (229 commits) Adding Ioperation tests for WhileUntilLoopStatement (dotnet#21047) marked assert that needs to be re-enabled addressing more PR feedbacks PR feedback Remove InvocationReasons enum boxing PR feedbacks Expose if a Binary/Unary operator was 'Lifted' at the IExpression level. (dotnet#14779) addressing PR feedback added comments Update VS Integration machines to 15.3 Preview 6 (dotnet#21240) fixed typo Fixed dotnet#18763 Compiler crash on bad code in the IDE (dotnet#20903) Fix typo in ERR_RefReturnParameter2 (dotnet#21235) Fix unbound recursion with const var field in script (dotnet#21223) Typo fix (dotnet#20513) PR feedbacks and added some more tests When producing character literals, surrogate characters should be escaped. (dotnet#20720) Fix build correctness issues Fix possible null reference warnings Adding ioperation tests for ForEachStatement (dotnet#21048) ...
Customer scenario
Use the Roslyn compiler APIs to produce a literal for a unicode character. The generated syntax should be correct (escaped). The bug is that the compiler produces a unicode surrogate directly into the generated program text.
Bugs this fixes:
Fixes #20693
Workarounds, if any
Client code that produces literals could have a special-case to work around the compiler bug.
Risk
Tiny, as the only effect is to handle this previously unhandled unicode category.
Performance impact
Tiny, if any, as it adds no new code paths.
Is this a regression from a previous update?
No, an old Roslyn bug only recently noticed.
Root cause analysis:
No test coverage for this category of unicode character in literals.
How was the bug found?
Customer reported.
@dotnet/roslyn-compiler May I please have a couple of reviews of this tiny bug fix?