Unify raw string lexing and parsing #80817

CyrusNajmabadi · 2025-10-20T10:54:36Z

The issue this is addressing is that both the lexer and the parser have a copy of all the code that reports errors on raw strings. And they both have a copy that determines the content meaning of the raw-string (including figuring out what the meaning is after dedentation).

This is a lot of duplication. And, as it turns out, they were slightly different in a couple of places (like where diagnostics are placed).

This PR unifies all that logic in the following manner:

Similar to 'interpolated strings', we have the lexer be dumb, just determining the start/end of the string token.
in the parsing phase we then process the raw-string tokens produced by the lexer.
the parser uses the same core logic for parsing a raw interpolated string as it does for a raw normal string. However, of course during raw-normal parsing, it doesn't actually produce interpolations.

A good mental model of this is that a raw-normal-string is just a raw-interpolated-string with no interpolations (no amount of {s starts an interpolation).

By doing that, errors are for normal vs interpolated raw strings are unified. As is string value/dedentation computation.

Note: we have exactly 500 tests around lexing/parsing. Only 22 had a tiny change in diagnostic location placement due to the unification. I looked into preserving the exact same semantics. but it was often very strange code that had to be written. Especially due to the nature of all the different string lexing/parsing routines we have. So i opted to ensure the vast majority were the same, while also giving totally fine, but slightly different errors, for a few cases.

Relates to feature #55306

- Create RawStringIndentationHelper with shared logic for both lexer and parser - Move CheckForSpaceDifference, CharToString, and StartsWith to helper class - Update Lexer_RawStringLiteral.cs to use helper methods - Update LanguageParser_InterpolatedString.cs to use helper methods - Tests pass: 165 RawStringLiteralLexingTests + 37 RawStringLiteralCompilingTests Co-authored-by: CyrusNajmabadi <4564579+CyrusNajmabadi@users.noreply.github.com>

Add UTF-8 BOM to match encoding requirements of other C# files in the project Co-authored-by: CyrusNajmabadi <4564579+CyrusNajmabadi@users.noreply.github.com>

CyrusNajmabadi · 2025-10-20T12:38:58Z

@dotnet/roslyn-compiler this is ready for review.

CyrusNajmabadi · 2025-10-20T12:41:31Z

src/Compilers/CSharp/Portable/Parser/LanguageParser.cs

@@ -11720,10 +11720,6 @@ ExpressionSyntax parsePrimaryExpressionWithoutPostfix(Precedence precedence)
                    case SyntaxKind.NumericLiteralToken:


Disable whitespace diff to understand best what changed (some methods were very stripped down, and without that, git thinks they're entirely removed, and then readded).

CyrusNajmabadi · 2025-10-20T12:42:32Z

src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs

+        /// <summary>
+        /// Converts a whitespace character to its string representation for error messages.
+        /// </summary>
+        private static string CharToString(char ch)


this is jsut a move. the lexer doesn't need it anymore

CyrusNajmabadi · 2025-10-20T12:43:37Z

@dotnet/roslyn-compiler this is ready for review.

CyrusNajmabadi · 2025-10-23T16:40:41Z

@jjonescz ptal.

CyrusNajmabadi · 2025-11-12T06:31:35Z

@dotnet/roslyn-compiler this is ready for review.

src/Compilers/CSharp/Portable/Parser/Lexer_RawStringLiteral.cs

jjonescz · 2025-11-12T09:48:43Z

src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs

+            Debug.Assert(originalText[0] is '"');
+            Debug.Assert(originalText[1] is '"');
+            Debug.Assert(originalText[2] is '"');


nit

Suggested change

Debug.Assert(originalText[0] is '"');

Debug.Assert(originalText[1] is '"');

Debug.Assert(originalText[2] is '"');

Debug.Assert(originalText is ['"', '"', '"', ..]);

src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs

jjonescz · 2025-11-12T09:51:37Z

src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs

+
+            var diagnosticsBuilder = ArrayBuilder<DiagnosticInfo>.GetInstance();
+            // Move any diagnostics on the original token to the new token.
+            // diagnosticsBuilder.AddRange(token.GetDiagnostics());


Why is this code commented out?

Great question. See the asserts and explanations in e112a7e for more detail.

Basically, this line was never necessary (given an assert a couple lines above that this token cannot have diagnostcs on it). I've also beefed up all teh code to be clearer and assert these invariants in a few places.

src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs

…edString.cs Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>

Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>

…edString.cs Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>

…gMore

src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs

RikkiGibson · 2025-11-12T23:13:14Z

src/Compilers/CSharp/Portable/Parser/Lexer_StringLiteral.cs

            }

-            internal void ScanInterpolatedStringLiteralTop(
+            internal void ScanStringLiteralTop(


I didn't follow why 'Interpolated' was deleted from this name. Wouldn't it make sense to use 'InterpolatedOrRaw' like many of the other methods are doing? Or did I miss something?

Sure! I can def rename.

ah. this is already in the type InterpolatedOrRawStringScanner. So i think it's fine for this to be ScanStringLiteralTop since this is already implied.

RikkiGibson · 2025-11-12T23:16:05Z

src/Compilers/CSharp/Portable/Parser/Lexer_StringLiteral.cs

                    ScanRawInterpolatedStringLiteralEnd(kind, startingQuoteCount);
+
+                    if (!_isInterpolatedString)
+                        _lexer.ScanUtf8Suffix();


This is needed because non-interpolated raw strings can now go through this path, right?

Copilot AI and others added 27 commits October 15, 2025 11:58

Initial plan

cbed0a1

Fix UTF-8 BOM in RawStringIndentationHelper.cs

899b8fd

Add UTF-8 BOM to match encoding requirements of other C# files in the project Co-authored-by: CyrusNajmabadi <4564579+CyrusNajmabadi@users.noreply.github.com>

In progress

20cab7c

Share codE

ad880a1

in progress

eb3a025

Share more code

7a45c73

Simplofy

06cc290

revert

a3bfa39

DeletE

83b2d36

Docs

ae9ad2a

Inference

7311075

Simplify

7284a55

REvefrt

3a70fc7

Less generics

c11c285

Revert

9276cbf

Make local function

fde59a5

make local function

10b03d3

Share code

c56173c

in progress

fed55ba

In progress

b341ead

In progress

b341f26

Always make text token in the raw case

3cad1ff

Fux utf8

beb2a20

Diagnostic location

43640a9

Consistent location for diagnostics

049ba76

Update test helper

ce81246

github-actions bot added the Area-Compilers label Oct 20, 2025

CyrusNajmabadi added 2 commits October 20, 2025 13:11

Fix location of end error

3d7ab1e

Fix location of end error

ff78486

CyrusNajmabadi marked this pull request as ready for review October 20, 2025 12:38

CyrusNajmabadi requested a review from a team as a code owner October 20, 2025 12:38

Inline function

9a776d9

CyrusNajmabadi commented Oct 20, 2025

View reviewed changes

Debug assert

7bf3599

CyrusNajmabadi commented Oct 20, 2025

View reviewed changes

Fixup errors

1c39f11

jcouv self-assigned this Oct 20, 2025

jjonescz reviewed Nov 12, 2025

View reviewed changes

CyrusNajmabadi and others added 6 commits November 12, 2025 11:31

Update src/Compilers/CSharp/Portable/Parser/LanguageParser_Interpolat…

ace5efb

…edString.cs Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>

Update src/Compilers/CSharp/Portable/Parser/LanguageParser_Interpolat…

b6b1c24

…edString.cs Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>

Update src/Compilers/CSharp/Portable/Parser/Lexer_RawStringLiteral.cs

357585f

Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>

Update src/Compilers/CSharp/Portable/Parser/LanguageParser_Interpolat…

189c21f

…edString.cs Co-authored-by: Jan Jones <jan.jones.cz@gmail.com>

Merge remote-tracking branch 'upstream/main' into unifyRawStringLexin…

d801109

…gMore

Clarify and add asserts

e112a7e

CyrusNajmabadi requested a review from jjonescz November 12, 2025 11:21

jjonescz approved these changes Nov 12, 2025

View reviewed changes

src/Compilers/CSharp/Portable/Parser/LanguageParser_InterpolatedString.cs Outdated Show resolved Hide resolved

CyrusNajmabadi added 2 commits November 12, 2025 12:43

Cleanup diagnostic computation

1a88596

Tweak assert

2f31307

RikkiGibson reviewed Nov 12, 2025

View reviewed changes

RikkiGibson approved these changes Nov 12, 2025

View reviewed changes

CyrusNajmabadi merged commit dc3c2e5 into dotnet:main Nov 13, 2025
25 checks passed

CyrusNajmabadi deleted the unifyRawStringLexingMore branch November 13, 2025 08:10

dotnet-policy-service bot added this to the Next milestone Nov 13, 2025

dotnet-bot mentioned this pull request Nov 15, 2025

[Automated] PRs inserted in VS build main-11215.03 #81262

Closed

		@@ -11720,10 +11720,6 @@ ExpressionSyntax parsePrimaryExpressionWithoutPostfix(Precedence precedence)
		case SyntaxKind.NumericLiteralToken:

Unify raw string lexing and parsing #80817

Unify raw string lexing and parsing #80817

Uh oh!

Conversation

CyrusNajmabadi commented Oct 20, 2025 • edited by jcouv Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyrusNajmabadi commented Oct 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Oct 20, 2025

Uh oh!

CyrusNajmabadi commented Oct 23, 2025

Uh oh!

CyrusNajmabadi commented Nov 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CyrusNajmabadi commented Oct 20, 2025 •

edited by jcouv

Loading