Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(regular_expression): Intro ConstructorParser(and LiteralParser) to handle escape sequence in RegExp('pat') #6635

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

leaysgur
Copy link
Contributor

@leaysgur leaysgur commented Oct 17, 2024

Preparation for #6141

oxc_regular_expression can already parse and validate both /regexp-literal/ and new RegExp("string-literal").

But one thing that is not well-supported was reporting Span for the RegExp("string-literal-with-\\escape") case.

For example, these two cases produce the same RegExp instances in JavaScript:

  • /\d+/
  • new RegExp("\\d+")

For now, mainly in oxc_linter, the latter case is parsed with oxc_parser -> ast::literal::StringLiteral AST node -> value property.

At this point, escape sequences are resolved(!), oxc_regular_expression can handle aligned &str as an argument without any problem in both cases.

However, in terms of Span representation, these cases should be handled differently because of the \\ in string literals...

As a result, the parsed AST's Span for new RegExp("string-literal") is not accurate if it contains escape sequences.

e.g.

eslint(no-invalid-regexp): Invalid regular expression: Invalid braced quantifier
╭─[no_invalid_regexp.tsx:1:14]
1 │ RegExp('\\u{0}*');
· ─
╰────

Each time the \ appears, the subsequent position is shifted. _ should be placed under * in this case.

So... to resolve this issue, we need to implement string_literal_parser first, and use them as reading units of oxc_regular_expression.

Copy link

graphite-app bot commented Oct 17, 2024

Your org has enabled the Graphite merge queue for merging into main

Add the label “0-merge” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

@github-actions github-actions bot added the C-bug Category - Bug label Oct 17, 2024
Copy link

codspeed-hq bot commented Oct 17, 2024

CodSpeed Performance Report

Merging #6635 will not alter performance

Comparing issue-6141 (f8e1907) with main (a7dd5aa)

Summary

✅ 30 untouched benchmarks

@leaysgur leaysgur marked this pull request as ready for review October 18, 2024 01:34
@leaysgur
Copy link
Contributor Author

In the current implementation, the string_literal_parser mod is encapsulated in the oxc_regular_expression crate and only used for pattern text.

But I noticed..., regular expression flags can also be string literal. 😮

new RegExp("x", "\165")
// is equivalent to
new RegExp("x", "u")

(I've never seen this before, but.)

  • Q1. Should we support string literal flags?

If we want to support this,

  • Q2. Should we turns out string_literal_parser a crate?
    • and oxc_regular_expression expect to be passed parsed results, instead of &str?
    • or still encapsulated?

@Boshen Plese take a look when you have time~!

@Boshen Boshen self-assigned this Oct 18, 2024
@Boshen Boshen self-requested a review October 18, 2024 04:42
@Boshen
Copy link
Member

Boshen commented Oct 18, 2024

This feels weird, the is exactly what the original lexer did 🤔 Are we supposed to parse the original raw regex text instead?

@leaysgur
Copy link
Contributor Author

Are we supposed to parse the original raw regex text instead?

Yes. #6141 (comment)

We will use StringLiteral(sl) => &sl.span.source_text(&self.source_text) instead of StringLiteral(sl) => &sl.value for RegExp("pat") case.

exactly what the original lexer did

Is there a way to get the detailed position(or token itself?) inside the string literal?

@Boshen
Copy link
Member

Boshen commented Oct 18, 2024

We will use StringLiteral(sl) => &sl.span.source_text(&self.source_text) instead of StringLiteral(sl) => &sl.value for RegExp("pat") case.

This is correct, how should we proceed with this PR?

Is there a way to get the detailed position(or token itself?) inside the string literal?

The lexer does not have them :-(

@Boshen
Copy link
Member

Boshen commented Oct 18, 2024

I checked the spec: https://tc39.es/ecma262/multipage/text-processing.html#sec-regexp-pattern-flags

There is a note:

If pattern is supplied using a StringLiteral, the usual escape sequence substitutions are performed before the String is processed by this function. If pattern must contain an escape sequence to be recognized by this function, any U+005C (REVERSE SOLIDUS) code points must be escaped within the StringLiteral to prevent them being removed when the contents of the StringLiteral are formed.

And in https://tc39.es/ecma262/multipage/text-processing.html#sec-regexpinitialize

  1. If u is true or v is true, then
    a. Let patternText be StringToCodePoints(P).
  2. Else,
    a. Let patternText be the result of interpreting each of P's 16-bit elements as a Unicode BMP code point. UTF-16 decoding is not applied to the elements.

I don't have the time to understand all these right now, but hope these context can help you to make a better decision.


When i doubt, check the spec 😅

@leaysgur
Copy link
Contributor Author

Sorry; maybe there was a problem with the way I asked. 😓

First, do you think introducing string literal parser is appropriate and worth merging?

The current issue is only the position of the Span in the error reporting.
Since syntax errors themselves are correctly detected, it might not be a problem in most cases.

(It may seem strange to say this after implementing, but) Do you think it's worth continuously maintaining this amount of code?
(I noticed you were concerned that the Lexer is already doing similar things, so I'm confirming just in case.)

Next, about the flags.

If introducing the string literal parser is the way to go, I think we should thoroughly support the flags as well.

So, I wanted to consult on how API design should be.

  • Should the existence of the string literal parser be encapsulated within the regular expression parser?
    • Regular expression parser continues to receive only &str and mode from the use case side.
    • Or split APIs like parse_regexp_literal(), parse_string_literal(), etc
  • Or should we expose the string literal parser and make its usage a responsibility of the use case side?
    • Regular expression parser expects Vec<StringLiteralParser::ast::CodePoint> to be passed.

Personally, I think the former is fine, but if there are better ideas, please let me know.


BTW, I hope you enjoy your stay in Japan! 🍣 🍵 🇯🇵

@Boshen
Copy link
Member

Boshen commented Oct 18, 2024

I'm fine with merging this PR and maintain it together, but we need to figure out the proper way of doing this to make it spec compliant in the longer run.

From the API perspective, I think we should expose what's written in the spec, i.e. an API for new Regex, and an API for /regex/. These are the only cases for constructing regexes.

  1. https://tc39.es/ecma262/#sec-abstract-operations-for-regexp-creation
  2. https://tc39.es/ecma262/#sec-regular-expression-literals-runtime-semantics-evaluation

@leaysgur
Copy link
Contributor Author

Thanks! I will reconsider how the API should be.

@leaysgur leaysgur marked this pull request as draft October 19, 2024 00:54
@leaysgur leaysgur changed the title fix(regular_expression): Intro string_literal_parser to handle escape sequence in RegExp('pat') feat(regular_expression): Intro ConstructorParser(and LiteralParser) to handle escape sequence in RegExp('pat') Oct 21, 2024
@github-actions github-actions bot added the C-enhancement Category - New feature or request label Oct 21, 2024
@leaysgur
Copy link
Contributor Author

leaysgur commented Oct 21, 2024

I finally settled on this API.

let options = Options { pattern_span_offset, flags_span_offset }; // Both optional

LiteralParser::new(allocator, "year\d+", Some("v"), options).parse()
ConstructorParser::new(allocator, "\"year\\d+\"", Some("\"v\""), options).parse()

To minimize diff, old APIs(Parser and ParserOptions) are still available for outer usages in this PR.

https://github.com/oxc-project/oxc/pull/6635/files#diff-1df796b787321a3cb29c2ffc0a2b5550efab069cbe7a3c60ddb82668b132097bR24

Once this PR is merged, I'm going to remove them and migrate usages in oxc_parser, oxc_linter, etc...

After that, let's update oxc_linter diagnostics if wanted.

@leaysgur leaysgur marked this pull request as ready for review October 21, 2024 06:07
@Boshen Boshen added the 0-merge Merge with Graphite Merge Queue label Oct 21, 2024
Copy link
Member

Boshen commented Oct 21, 2024

Merge activity

  • Oct 21, 3:07 AM EDT: The merge label '0-merge' was detected. This PR will be added to the Graphite merge queue once it meets the requirements.
  • Oct 21, 3:07 AM EDT: A user added this pull request to the Graphite merge queue.
  • Oct 21, 3:11 AM EDT: A user merged this pull request with the Graphite merge queue.

…r`) to handle escape sequence in RegExp('pat') (#6635)

Preparation for #6141

`oxc_regular_expression` can already parse and validate both `/regexp-literal/` and `new RegExp("string-literal")`.

But one thing that is not well-supported was reporting `Span` for the `RegExp("string-literal-with-\\escape")` case.

For example, these two cases produce the same `RegExp` instances in JavaScript:

- `/\d+/`
- `new RegExp("\\d+")`

For now, mainly in `oxc_linter`, the latter case is parsed with `oxc_parser` -> `ast::literal::StringLiteral` AST node -> `value` property.

At this point, escape sequences are resolved(!), `oxc_regular_expression` can handle aligned `&str` as an argument without any problem in both cases.

However, in terms of `Span` representation, these cases should be handled differently because of the `\\` in string literals...

As a result, the parsed AST's `Span` for `new RegExp("string-literal")` is not accurate if it contains escape sequences.

e.g. https://github.com/oxc-project/oxc/blob/a01a5dfdafb9cd536cb87867697e3ae43b1990e6/crates/oxc_linter/src/snapshots/no_invalid_regexp.snap#L118-L122

Each time the `\` appears, the subsequent position is shifted. `_` should be placed under `*` in this case.

So... to resolve this issue, we need to implement `string_literal_parser` first, and use them as reading units of `oxc_regular_expression`.
@graphite-app graphite-app bot merged commit f8e1907 into main Oct 21, 2024
27 checks passed
@graphite-app graphite-app bot deleted the issue-6141 branch October 21, 2024 07:11
Boshen pushed a commit that referenced this pull request Oct 22, 2024
Follow up #6635

- [x] Remove old APIs
- [x] Update linter usage
- [x] Update parser usage
- [x] Update transformer usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0-merge Merge with Graphite Merge Queue C-bug Category - Bug C-enhancement Category - New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants