Skip to content

Linter plugins: Token-related SourceCode APIs #14829

@overlookmotel

Description

@overlookmotel

As noted in JS Plugins Milestone 2 (#14826), our first priority is to implement all APIs related to "fix code"-type linter plugins.

Later on, we'll want to also support stylistic/formatting rules, which rely on Token-related APIs in SourceCode. These methods are currently stubbed out in tokens.ts.

JS implementation

Unlike many other APIs, I am unclear if it's feasible to implement the Token-related APIs by pulling in a JS dependency.

@typescript-eslint/parser does provide an array of Tokens from its parser, but it uses TypeScript's parser under the hood. Parsing every file again on JS side using TypeScript would be really slow.

espree (ESLint's parser) has a tokenize() method. That doesn't parse, only tokenizes, so its performance might be acceptable (according to Claude, anyway). However, will it work for TS? Espree can't parse TS, but maybe that doesn't matter when you're only asking it to tokenize? (Claude says yes, but I don't entirely believe him!)

If Espree can work for us, I think this should be our first port of call, in interests of getting a working implementation out quickly.

Laziness

Like we do with comments, we'd want to lazily call tokenize(), only when one of the token-related APIs is first used, so that this work is skipped when no JS rule requires tokens.

Rust implementation

This will be what we'll want to do in the end, as it'll be faster. However, it's challenging.

Altering the lexer

Oxc's lexer generates a stream of Tokens, however, they're not stored anywhere. We'd need to adapt it to also store all the Tokens in an oxc_allocator::Vec in the arena. That Vec can then be shared with JS through the same "raw transfer" mechanism as is used for the AST.

One difficulty is that the lexer sometimes has to unwind to where syntax is ambiguous reading forwards through the file. e.g. is (a, b) a SequenceExpression, or the start of an arrow function (a, b) => ...? When rewinding occurs, we'd need to also rewind the stream of Tokens so they don't get included twice in these cases.

Avoiding perf hit

Oxc's parser is not only used in the linter - it's the first step of the parse-transform-minify-print and parse-format pipelines. Its performance in these pipelines is critical, and hurting perf while implementing what we need for linter would be unacceptable.

We'll need some kind of conditional compilation, so linter uses a different version of the parser from other pipelines.

Cargo features are problematic, due to feature unification in tests, so I would propose an approach using generics:

// ----- Traits -----

trait ParserConfig {
    type Tokens<'a>: TokenStore<'a>;
}

trait TokenStore<'a> {
    type Checkpoint;

    fn new(allocator: &'a Allocator) -> Self;
    fn push(token: Token);
    fn checkpoint(&self) -> Self::Checkpoint;
    fn rewind(&mut self, checkpoint: Self::Checkpoint);
}

// ----- parse-transform-minify-print pipeline -----

struct StandardParserConfig;

impl ParserConfig for StandardParserConfig {
    type Tokens<'a> = NoopTokenStore;
}

struct NoopTokenStore;

impl<'a> TokenStore<'a> for NoopTokenStore {
    type Checkpoint = ();

    fn new(_allocator: &'a Allocator) -> Self { Self }
    fn push(_token: Token) {}
    fn checkpoint(&self) {}
    fn rewind(&mut self, _checkpoint: ()) {}
}

// ----- Oxlint -----

struct OxlintParserConfig;

impl ParserConfig for OxlintParserConfig {
    type Tokens<'a> = RealTokenStore<'a>;
}

struct RealTokenStore<'a> {
    tokens: Vec<'a, Token>,
}

impl<'a> TokenStore<'a> for RealTokenStore<'a> {
    type Checkpoint = u32;

    fn new(allocator: &'a Allocator) -> Self {
        Self { tokens: Vec::new_in(allocator) }
    }
    fn push(token: Token) {
        self.tokens.push(token);
    }
    fn checkpoint(&self) -> u32 {
        self.tokens.len() as u32
    }
    fn rewind(&mut self, checkpoint: u32) {
        self.tokens.truncate(checkpoint as usize);
    }
}

// Parser becomes generic over `ParserConfig`

pub struct Parser<'a, Config: ParserConfig> {
    // ...
    marker: PhantomData<Config>,
}

struct ParserImpl<'a, Config: ParserConfig> {
    // ...
    tokens: Config::Tokens<'a>,
}

impl<'a, Config: ParserConfig> ParserImpl<'a, Config> {
    fn somewhere_in_parser_or_lexer(&mut self, token: Token) {
        // This is a complete no-op when `Config` is `StandardParserConfig`
        self.tokens.push(token);
    }
}

Token type

Important note: We do not want to increase the size of the Rust Token type unless absolutely unavoidable. Keeping Token as a single u128 is critical for performance.

Why the additional ParserConfig abstraction?

The above code example would work without ParserConfig. Parser and ParserImpl could both be generic over TokenStore, and we could "cut out the middleman" of ParserConfig - it adds no further value.

Reason for ParserConfig is that I think it could be useful for other purposes further down the line. e.g. we'll want at some point fairly soon to move construction of the UTF-8 to UTF-16 translation table into lexer. We'd want that to be generic too, so the cost of that is only paid by the linter. We can add that to ParserConfig to avoid 2 generic params everywhere Parser<T: TokenStore, U: UtfTranslationTable>.

It'd be nice if we can make that change later on without loads of code churn - introducing ParserConfig now enables that.

Translation to ESLint Token format

Similar to how we had to build a translation mechanism from Oxc's Rust AST to ESTree standard, we'll likely need to translate from Oxc's Token format to ESLint's format. What's needed here requires research.

The big question is: Our current Token type doesn't include the content of the tokens, only their start and end position, the token's Kind (type), and some flags. Can we avoid altering Token type and get the token content (e.g. the content of a string literal token) on JS side by slicing the source text using start and end?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions