Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions src/Compilers/CSharp/Portable/Parser/AbstractLexer.cs
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@

using System;
using System.Collections.Generic;
using Microsoft.CodeAnalysis.CSharp.Symbols;
using Microsoft.CodeAnalysis.CSharp.Syntax;
using Microsoft.CodeAnalysis.Text;

namespace Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax
Expand All @@ -21,6 +19,8 @@ protected AbstractLexer(SourceText text)
this.TextWindow = new SlidingTextWindow(text);
}

protected int LexemeStartPosition => this.TextWindow.LexemeStartPosition;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intent is to move LexemeStartPosition into lexer, so that only the lexer cares about lexemes, and the textwindow only cares about being a fast streaming sequence of chars.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a lexeme? Is that like a token?

Copy link
Member Author

@CyrusNajmabadi CyrusNajmabadi Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of, and i can probably doc. It's "the entity the lexer is currently producing". This is commonly the text of BOTH trivias AND tokens (without its trivia).

It's what you generally expect to get back if you ask the Token/Trivia for its .Text property (not .FullText, and not .ValueText).

Ignoring things like directives, the lexer generally is pointing at some position in the source. And it will 'start' lexing a 'lexeme' at that point. It consumes forward, based on certain rules about what it is currently consuming, until it 'finishes' that lexeme. At which point it generates a result (token or trivia in the majority case). That result is given a Kind, Text, and potentially other bits and bobs attached to it.

The goal here is to make the sliding-text-window care absolutely not one whit about lexer concepts, and keep itself only in the domain of making character-retrieval efficient. So lexemes and the like move up entirely to the lexer. This actually simplifies a bunch, and makes it harder to get things wrong.

FOr example, in the last year, there was a tweak to the sliding text window to allow it to look backwards. However, because the window itself was tracking lexemes, it could get into a corrupt state when it did that, leading to bad results being returned upwards in edge-case scenarios. THis split would help avoid that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR:

It's the smallest piece of Text hte lexer grabs out as an individual string to jam into either a Token or Trivia. it is indivisible.


public virtual void Dispose()
{
this.TextWindow.Dispose();
Expand Down Expand Up @@ -131,9 +131,18 @@ protected XmlSyntaxDiagnosticInfo MakeError(int position, int width, XmlParseErr

private int GetLexemeOffsetFromPosition(int position)
{
return position >= TextWindow.LexemeStartPosition ? position - TextWindow.LexemeStartPosition : position;
return position >= LexemeStartPosition ? position - LexemeStartPosition : position;
}

protected string GetNonInternedLexemeText()
=> TextWindow.GetText(intern: false);

protected string GetInternedLexemeText()
=> TextWindow.GetText(intern: true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these helpers are here because GetText implicitly uses LexemeStartPosition. Once that is removed from the text window itself, it will need to be passed in (as the start position to read from, up to the text window's current position). So this means instead of having to update a huge number of sites, only this site is updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: i wanted all lexeme-oriented operations to have that in their name. It's not at all evident what "TextWindow.GetText" or "TextWindow.Width" even means. Names like "CurrentLexemeWidth" are much clearer that it refers to the length of the current token being lexed out.


protected int CurrentLexemeWidth
=> this.TextWindow.Position - LexemeStartPosition;

protected static SyntaxDiagnosticInfo MakeError(ErrorCode code)
{
return new SyntaxDiagnosticInfo(code);
Expand Down
Loading