Skip to content

Conversation

@CyrusNajmabadi
Copy link
Member

@CyrusNajmabadi CyrusNajmabadi commented Oct 20, 2025

Fixes #80731
Fixes #59044

The issue this is addressing is that both the lexer and the parser have a copy of all the code that reports errors on raw strings. And they both have a copy that determines the content meaning of the raw-string (including figuring out what the meaning is after dedentation).

This is a lot of duplication. And, as it turns out, they were slightly different in a couple of places (like where diagnostics are placed).

This PR unifies all that logic in the following manner:

  1. Similar to 'interpolated strings', we have the lexer be dumb, just determining the start/end of the string token.
  2. in the parsing phase we then process the raw-string tokens produced by the lexer.
  3. the parser uses the same core logic for parsing a raw interpolated string as it does for a raw normal string. However, of course during raw-normal parsing, it doesn't actually produce interpolations.

A good mental model of this is that a raw-normal-string is just a raw-interpolated-string with no interpolations (no amount of {s starts an interpolation).

By doing that, errors are for normal vs interpolated raw strings are unified. As is string value/dedentation computation.

Note: we have exactly 500 tests around lexing/parsing. Only 22 had a tiny change in diagnostic location placement due to the unification. I looked into preserving the exact same semantics. but it was often very strange code that had to be written. Especially due to the nature of all the different string lexing/parsing routines we have. So i opted to ensure the vast majority were the same, while also giving totally fine, but slightly different errors, for a few cases.

Relates to feature #55306

Copilot AI and others added 27 commits October 15, 2025 11:58
- Create RawStringIndentationHelper with shared logic for both lexer and parser
- Move CheckForSpaceDifference, CharToString, and StartsWith to helper class
- Update Lexer_RawStringLiteral.cs to use helper methods
- Update LanguageParser_InterpolatedString.cs to use helper methods
- Tests pass: 165 RawStringLiteralLexingTests + 37 RawStringLiteralCompilingTests

Co-authored-by: CyrusNajmabadi <4564579+CyrusNajmabadi@users.noreply.github.com>
Add UTF-8 BOM to match encoding requirements of other C# files in the project

Co-authored-by: CyrusNajmabadi <4564579+CyrusNajmabadi@users.noreply.github.com>
@CyrusNajmabadi CyrusNajmabadi marked this pull request as ready for review October 20, 2025 12:38
@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner October 20, 2025 12:38
@CyrusNajmabadi
Copy link
Member Author

@dotnet/roslyn-compiler this is ready for review.

@@ -11720,10 +11720,6 @@ ExpressionSyntax parsePrimaryExpressionWithoutPostfix(Precedence precedence)
case SyntaxKind.NumericLiteralToken:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disable whitespace diff to understand best what changed (some methods were very stripped down, and without that, git thinks they're entirely removed, and then readded).

/// <summary>
/// Converts a whitespace character to its string representation for error messages.
/// </summary>
private static string CharToString(char ch)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is jsut a move. the lexer doesn't need it anymore

@CyrusNajmabadi
Copy link
Member Author

@dotnet/roslyn-compiler this is ready for review.

@jcouv jcouv self-assigned this Oct 20, 2025
@CyrusNajmabadi
Copy link
Member Author

@jjonescz ptal.

@CyrusNajmabadi
Copy link
Member Author

@dotnet/roslyn-compiler this is ready for review.

Comment on lines 29 to 31
Debug.Assert(originalText[0] is '"');
Debug.Assert(originalText[1] is '"');
Debug.Assert(originalText[2] is '"');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
Debug.Assert(originalText[0] is '"');
Debug.Assert(originalText[1] is '"');
Debug.Assert(originalText[2] is '"');
Debug.Assert(originalText is ['"', '"', '"', ..]);


var diagnosticsBuilder = ArrayBuilder<DiagnosticInfo>.GetInstance();
// Move any diagnostics on the original token to the new token.
// diagnosticsBuilder.AddRange(token.GetDiagnostics());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this code commented out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. See the asserts and explanations in e112a7e for more detail.

Basically, this line was never necessary (given an assert a couple lines above that this token cannot have diagnostcs on it). I've also beefed up all teh code to be clearer and assert these invariants in a few places.

}

internal void ScanInterpolatedStringLiteralTop(
internal void ScanStringLiteralTop(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow why 'Interpolated' was deleted from this name. Wouldn't it make sense to use 'InterpolatedOrRaw' like many of the other methods are doing? Or did I miss something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I can def rename.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. this is already in the type InterpolatedOrRawStringScanner. So i think it's fine for this to be ScanStringLiteralTop since this is already implied.

ScanRawInterpolatedStringLiteralEnd(kind, startingQuoteCount);

if (!_isInterpolatedString)
_lexer.ScanUtf8Suffix();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because non-interpolated raw strings can now go through this path, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

@CyrusNajmabadi CyrusNajmabadi merged commit dc3c2e5 into dotnet:main Nov 13, 2025
25 checks passed
@CyrusNajmabadi CyrusNajmabadi deleted the unifyRawStringLexingMore branch November 13, 2025 08:10
@dotnet-policy-service dotnet-policy-service bot added this to the Next milestone Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

raw string lexing/parsing contains lots of duplication. Investigate reducing raw-string lexing/parsing logic duplication.

4 participants