-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Unify raw string lexing and parsing #80817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify raw string lexing and parsing #80817
Conversation
- Create RawStringIndentationHelper with shared logic for both lexer and parser - Move CheckForSpaceDifference, CharToString, and StartsWith to helper class - Update Lexer_RawStringLiteral.cs to use helper methods - Update LanguageParser_InterpolatedString.cs to use helper methods - Tests pass: 165 RawStringLiteralLexingTests + 37 RawStringLiteralCompilingTests Co-authored-by: CyrusNajmabadi <4564579+CyrusNajmabadi@users.noreply.github.com>
Add UTF-8 BOM to match encoding requirements of other C# files in the project Co-authored-by: CyrusNajmabadi <4564579+CyrusNajmabadi@users.noreply.github.com>
|
@dotnet/roslyn-compiler this is ready for review. |
| @@ -11720,10 +11720,6 @@ ExpressionSyntax parsePrimaryExpressionWithoutPostfix(Precedence precedence) | |||
| case SyntaxKind.NumericLiteralToken: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disable whitespace diff to understand best what changed (some methods were very stripped down, and without that, git thinks they're entirely removed, and then readded).
| /// <summary> | ||
| /// Converts a whitespace character to its string representation for error messages. | ||
| /// </summary> | ||
| private static string CharToString(char ch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is jsut a move. the lexer doesn't need it anymore
|
@dotnet/roslyn-compiler this is ready for review. |
|
@jjonescz ptal. |
|
@dotnet/roslyn-compiler this is ready for review. |
| Debug.Assert(originalText[0] is '"'); | ||
| Debug.Assert(originalText[1] is '"'); | ||
| Debug.Assert(originalText[2] is '"'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| Debug.Assert(originalText[0] is '"'); | |
| Debug.Assert(originalText[1] is '"'); | |
| Debug.Assert(originalText[2] is '"'); | |
| Debug.Assert(originalText is ['"', '"', '"', ..]); |
src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs
Outdated
Show resolved
Hide resolved
|
|
||
| var diagnosticsBuilder = ArrayBuilder<DiagnosticInfo>.GetInstance(); | ||
| // Move any diagnostics on the original token to the new token. | ||
| // diagnosticsBuilder.AddRange(token.GetDiagnostics()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this code commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. See the asserts and explanations in e112a7e for more detail.
Basically, this line was never necessary (given an assert a couple lines above that this token cannot have diagnostcs on it). I've also beefed up all teh code to be clearer and assert these invariants in a few places.
src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs
Outdated
Show resolved
Hide resolved
…edString.cs Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>
…edString.cs Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>
Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>
…edString.cs Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>
src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| internal void ScanInterpolatedStringLiteralTop( | ||
| internal void ScanStringLiteralTop( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow why 'Interpolated' was deleted from this name. Wouldn't it make sense to use 'InterpolatedOrRaw' like many of the other methods are doing? Or did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I can def rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah. this is already in the type InterpolatedOrRawStringScanner. So i think it's fine for this to be ScanStringLiteralTop since this is already implied.
| ScanRawInterpolatedStringLiteralEnd(kind, startingQuoteCount); | ||
|
|
||
| if (!_isInterpolatedString) | ||
| _lexer.ScanUtf8Suffix(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because non-interpolated raw strings can now go through this path, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
Fixes #80731
Fixes #59044
The issue this is addressing is that both the lexer and the parser have a copy of all the code that reports errors on raw strings. And they both have a copy that determines the content meaning of the raw-string (including figuring out what the meaning is after dedentation).
This is a lot of duplication. And, as it turns out, they were slightly different in a couple of places (like where diagnostics are placed).
This PR unifies all that logic in the following manner:
A good mental model of this is that a raw-normal-string is just a raw-interpolated-string with no interpolations (no amount of
{s starts an interpolation).By doing that, errors are for normal vs interpolated raw strings are unified. As is string value/dedentation computation.
Note: we have exactly 500 tests around lexing/parsing. Only 22 had a tiny change in diagnostic location placement due to the unification. I looked into preserving the exact same semantics. but it was often very strange code that had to be written. Especially due to the nature of all the different string lexing/parsing routines we have. So i opted to ensure the vast majority were the same, while also giving totally fine, but slightly different errors, for a few cases.
Relates to feature #55306