feat(regular_expression): Intro `ConstructorParser`(and `LiteralParser`) to handle escape sequence in RegExp('pat') #6635

leaysgur · 2024-10-17T02:59:55Z

Preparation for #6141

oxc_regular_expression can already parse and validate both /regexp-literal/ and new RegExp("string-literal").

But one thing that is not well-supported was reporting Span for the RegExp("string-literal-with-\\escape") case.

For example, these two cases produce the same RegExp instances in JavaScript:

/\d+/
new RegExp("\\d+")

For now, mainly in oxc_linter, the latter case is parsed with oxc_parser -> ast::literal::StringLiteral AST node -> value property.

At this point, escape sequences are resolved(!), oxc_regular_expression can handle aligned &str as an argument without any problem in both cases.

However, in terms of Span representation, these cases should be handled differently because of the \\ in string literals...

As a result, the parsed AST's Span for new RegExp("string-literal") is not accurate if it contains escape sequences.

e.g.

oxc/crates/oxc_linter/src/snapshots/no_invalid_regexp.snap

Lines 118 to 122 in a01a5df

    
            ⚠ eslint(no-invalid-regexp): Invalid regular expression: Invalid braced quantifier 
        
             ╭─[no_invalid_regexp.tsx:1:14] 
        
           1 │ RegExp('\\u{0}*'); 
        
             ·              ─ 
        
             ╰────

Each time the \ appears, the subsequent position is shifted. _ should be placed under * in this case.

So... to resolve this issue, we need to implement string_literal_parser first, and use them as reading units of oxc_regular_expression.

graphite-app · 2024-10-17T03:00:00Z

Your org has enabled the Graphite merge queue for merging into main

Add the label “0-merge” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

codspeed-hq · 2024-10-17T03:06:47Z

CodSpeed Performance Report

Merging #6635 will not alter performance

_{Comparing issue-6141 (f8e1907) with main (a7dd5aa)}

Summary

✅ 30 untouched benchmarks

leaysgur · 2024-10-18T01:35:22Z

In the current implementation, the string_literal_parser mod is encapsulated in the oxc_regular_expression crate and only used for pattern text.

But I noticed..., regular expression flags can also be string literal. 😮

new RegExp("x", "\165")
// is equivalent to
new RegExp("x", "u")

(I've never seen this before, but.)

Q1. Should we support string literal flags?

If we want to support this,

Q2. Should we turns out string_literal_parser a crate?
- and oxc_regular_expression expect to be passed parsed results, instead of &str?
- or still encapsulated?

@Boshen Plese take a look when you have time~!

Boshen · 2024-10-18T04:44:09Z

This feels weird, the is exactly what the original lexer did 🤔 Are we supposed to parse the original raw regex text instead?

leaysgur · 2024-10-18T05:59:36Z

Are we supposed to parse the original raw regex text instead?

Yes. #6141 (comment)

We will use StringLiteral(sl) => &sl.span.source_text(&self.source_text) instead of StringLiteral(sl) => &sl.value for RegExp("pat") case.

exactly what the original lexer did

Is there a way to get the detailed position(or token itself?) inside the string literal?

Boshen · 2024-10-18T11:23:12Z

We will use StringLiteral(sl) => &sl.span.source_text(&self.source_text) instead of StringLiteral(sl) => &sl.value for RegExp("pat") case.

This is correct, how should we proceed with this PR?

Is there a way to get the detailed position(or token itself?) inside the string literal?

The lexer does not have them :-(

Boshen · 2024-10-18T11:29:51Z

I checked the spec: https://tc39.es/ecma262/multipage/text-processing.html#sec-regexp-pattern-flags

There is a note:

If pattern is supplied using a StringLiteral, the usual escape sequence substitutions are performed before the String is processed by this function. If pattern must contain an escape sequence to be recognized by this function, any U+005C (REVERSE SOLIDUS) code points must be escaped within the StringLiteral to prevent them being removed when the contents of the StringLiteral are formed.

And in https://tc39.es/ecma262/multipage/text-processing.html#sec-regexpinitialize

If u is true or v is true, then
a. Let patternText be StringToCodePoints(P).
Else,
a. Let patternText be the result of interpreting each of P's 16-bit elements as a Unicode BMP code point. UTF-16 decoding is not applied to the elements.

I don't have the time to understand all these right now, but hope these context can help you to make a better decision.

When i doubt, check the spec 😅

leaysgur · 2024-10-18T13:46:38Z

Sorry; maybe there was a problem with the way I asked. 😓

First, do you think introducing string literal parser is appropriate and worth merging?

The current issue is only the position of the Span in the error reporting.
Since syntax errors themselves are correctly detected, it might not be a problem in most cases.

(It may seem strange to say this after implementing, but) Do you think it's worth continuously maintaining this amount of code?
(I noticed you were concerned that the Lexer is already doing similar things, so I'm confirming just in case.)

Next, about the flags.

If introducing the string literal parser is the way to go, I think we should thoroughly support the flags as well.

So, I wanted to consult on how API design should be.

Should the existence of the string literal parser be encapsulated within the regular expression parser?
- Regular expression parser continues to receive only &str and mode from the use case side.
- Or split APIs like parse_regexp_literal(), parse_string_literal(), etc
Or should we expose the string literal parser and make its usage a responsibility of the use case side?
- Regular expression parser expects Vec<StringLiteralParser::ast::CodePoint> to be passed.

Personally, I think the former is fine, but if there are better ideas, please let me know.

BTW, I hope you enjoy your stay in Japan! 🍣 🍵 🇯🇵

Boshen · 2024-10-18T14:12:01Z

I'm fine with merging this PR and maintain it together, but we need to figure out the proper way of doing this to make it spec compliant in the longer run.

From the API perspective, I think we should expose what's written in the spec, i.e. an API for new Regex, and an API for /regex/. These are the only cases for constructing regexes.

leaysgur · 2024-10-19T00:54:41Z

Thanks! I will reconsider how the API should be.

leaysgur · 2024-10-21T06:07:30Z

I finally settled on this API.

let options = Options { pattern_span_offset, flags_span_offset }; // Both optional

LiteralParser::new(allocator, "year\d+", Some("v"), options).parse()
ConstructorParser::new(allocator, "\"year\\d+\"", Some("\"v\""), options).parse()

To minimize diff, old APIs(Parser and ParserOptions) are still available for outer usages in this PR.

https://github.com/oxc-project/oxc/pull/6635/files#diff-1df796b787321a3cb29c2ffc0a2b5550efab069cbe7a3c60ddb82668b132097bR24

Once this PR is merged, I'm going to remove them and migrate usages in oxc_parser, oxc_linter, etc...

After that, let's update oxc_linter diagnostics if wanted.

Boshen · 2024-10-21T07:07:06Z

Merge activity

Oct 21, 3:07 AM EDT: The merge label '0-merge' was detected. This PR will be added to the Graphite merge queue once it meets the requirements.
Oct 21, 3:07 AM EDT: A user added this pull request to the Graphite merge queue.
Oct 21, 3:11 AM EDT: A user merged this pull request with the Graphite merge queue.

…r`) to handle escape sequence in RegExp('pat') (#6635) Preparation for #6141 `oxc_regular_expression` can already parse and validate both `/regexp-literal/` and `new RegExp("string-literal")`. But one thing that is not well-supported was reporting `Span` for the `RegExp("string-literal-with-\\escape")` case. For example, these two cases produce the same `RegExp` instances in JavaScript: - `/\d+/` - `new RegExp("\\d+")` For now, mainly in `oxc_linter`, the latter case is parsed with `oxc_parser` -> `ast::literal::StringLiteral` AST node -> `value` property. At this point, escape sequences are resolved(!), `oxc_regular_expression` can handle aligned `&str` as an argument without any problem in both cases. However, in terms of `Span` representation, these cases should be handled differently because of the `\\` in string literals... As a result, the parsed AST's `Span` for `new RegExp("string-literal")` is not accurate if it contains escape sequences. e.g. https://github.com/oxc-project/oxc/blob/a01a5dfdafb9cd536cb87867697e3ae43b1990e6/crates/oxc_linter/src/snapshots/no_invalid_regexp.snap#L118-L122 Each time the `\` appears, the subsequent position is shifted. `_` should be placed under `*` in this case. So... to resolve this issue, we need to implement `string_literal_parser` first, and use them as reading units of `oxc_regular_expression`.

Follow up #6635 - [x] Remove old APIs - [x] Update linter usage - [x] Update parser usage - [x] Update transformer usage

github-actions bot added the C-bug Category - Bug label Oct 17, 2024

leaysgur marked this pull request as ready for review October 18, 2024 01:34

Boshen self-assigned this Oct 18, 2024

Boshen self-requested a review October 18, 2024 04:42

leaysgur marked this pull request as draft October 19, 2024 00:54

leaysgur changed the title ~~fix(regular_expression): Intro string_literal_parser to handle escape sequence in RegExp('pat')~~ feat(regular_expression): Intro ConstructorParser(and LiteralParser) to handle escape sequence in RegExp('pat') Oct 21, 2024

github-actions bot added the C-enhancement Category - New feature or request label Oct 21, 2024

leaysgur marked this pull request as ready for review October 21, 2024 06:07

Boshen added the 0-merge Merge with Graphite Merge Queue label Oct 21, 2024

Boshen force-pushed the issue-6141 branch from 55911a6 to f8e1907 Compare October 21, 2024 07:07

graphite-app bot merged commit f8e1907 into main Oct 21, 2024
27 checks passed

graphite-app bot deleted the issue-6141 branch October 21, 2024 07:11

leaysgur mentioned this pull request Oct 21, 2024

fix(regular_expression)!: Migrate to new regexp parser API #6741

Merged

4 tasks

Boshen pushed a commit that referenced this pull request Oct 22, 2024

fix(regular_expression)!: Migrate to new regexp parser API (#6741)

8032813

Follow up #6635 - [x] Remove old APIs - [x] Update linter usage - [x] Update parser usage - [x] Update transformer usage

leaysgur mentioned this pull request Oct 22, 2024

regular_expression: report character span offsets not byte offsets #6141

Closed

oxc-bot mentioned this pull request Oct 24, 2024

release(crates): v0.33.0 #6843

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(regular_expression): Intro `ConstructorParser`(and `LiteralParser`) to handle escape sequence in RegExp('pat') #6635

feat(regular_expression): Intro `ConstructorParser`(and `LiteralParser`) to handle escape sequence in RegExp('pat') #6635

leaysgur commented Oct 17, 2024 •

edited

Loading

graphite-app bot commented Oct 17, 2024

codspeed-hq bot commented Oct 17, 2024 •

edited

Loading

leaysgur commented Oct 18, 2024

Boshen commented Oct 18, 2024

leaysgur commented Oct 18, 2024

Boshen commented Oct 18, 2024

Boshen commented Oct 18, 2024 •

edited

Loading

leaysgur commented Oct 18, 2024

Boshen commented Oct 18, 2024

leaysgur commented Oct 19, 2024

leaysgur commented Oct 21, 2024 •

edited

Loading

Boshen commented Oct 21, 2024 •

edited

Loading

	⚠ eslint(no-invalid-regexp): Invalid regular expression: Invalid braced quantifier
	╭─[no_invalid_regexp.tsx:1:14]
	1 │ RegExp('\\u{0}*');
	· ─
	╰────

feat(regular_expression): Intro ConstructorParser(and LiteralParser) to handle escape sequence in RegExp('pat') #6635

feat(regular_expression): Intro ConstructorParser(and LiteralParser) to handle escape sequence in RegExp('pat') #6635

Conversation

leaysgur commented Oct 17, 2024 • edited Loading

graphite-app bot commented Oct 17, 2024

Your org has enabled the Graphite merge queue for merging into main

codspeed-hq bot commented Oct 17, 2024 • edited Loading

CodSpeed Performance Report

Merging #6635 will not alter performance

Summary

leaysgur commented Oct 18, 2024

Boshen commented Oct 18, 2024

leaysgur commented Oct 18, 2024

Boshen commented Oct 18, 2024

Boshen commented Oct 18, 2024 • edited Loading

leaysgur commented Oct 18, 2024

Boshen commented Oct 18, 2024

leaysgur commented Oct 19, 2024

leaysgur commented Oct 21, 2024 • edited Loading

Boshen commented Oct 21, 2024 • edited Loading

Merge activity

feat(regular_expression): Intro `ConstructorParser`(and `LiteralParser`) to handle escape sequence in RegExp('pat') #6635

feat(regular_expression): Intro `ConstructorParser`(and `LiteralParser`) to handle escape sequence in RegExp('pat') #6635

leaysgur commented Oct 17, 2024 •

edited

Loading

codspeed-hq bot commented Oct 17, 2024 •

edited

Loading

Boshen commented Oct 18, 2024 •

edited

Loading

leaysgur commented Oct 21, 2024 •

edited

Loading

Boshen commented Oct 21, 2024 •

edited

Loading