-
-
Notifications
You must be signed in to change notification settings - Fork 489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(regular_expression): Intro ConstructorParser
(and LiteralParser
) to handle escape sequence in RegExp('pat')
#6635
Conversation
Your org has enabled the Graphite merge queue for merging into mainAdd the label “0-merge” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “hotfix” to add to the merge queue as a hot fix. You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link. |
CodSpeed Performance ReportMerging #6635 will not alter performanceComparing Summary
|
In the current implementation, the But I noticed..., regular expression flags can also be string literal. 😮 new RegExp("x", "\165")
// is equivalent to
new RegExp("x", "u") (I've never seen this before, but.)
If we want to support this,
@Boshen Plese take a look when you have time~! |
This feels weird, the is exactly what the original lexer did 🤔 Are we supposed to parse the original raw regex text instead? |
Yes. #6141 (comment) We will use
Is there a way to get the detailed position(or token itself?) inside the string literal? |
This is correct, how should we proceed with this PR?
The lexer does not have them :-( |
I checked the spec: https://tc39.es/ecma262/multipage/text-processing.html#sec-regexp-pattern-flags There is a note: If pattern is supplied using a StringLiteral, the usual escape sequence substitutions are performed before the String is processed by this function. If pattern must contain an escape sequence to be recognized by this function, any U+005C (REVERSE SOLIDUS) code points must be escaped within the StringLiteral to prevent them being removed when the contents of the StringLiteral are formed. And in https://tc39.es/ecma262/multipage/text-processing.html#sec-regexpinitialize
I don't have the time to understand all these right now, but hope these context can help you to make a better decision. When i doubt, check the spec 😅 |
Sorry; maybe there was a problem with the way I asked. 😓 First, do you think introducing string literal parser is appropriate and worth merging? The current issue is only the position of the (It may seem strange to say this after implementing, but) Do you think it's worth continuously maintaining this amount of code? Next, about the flags. If introducing the string literal parser is the way to go, I think we should thoroughly support the flags as well. So, I wanted to consult on how API design should be.
Personally, I think the former is fine, but if there are better ideas, please let me know. BTW, I hope you enjoy your stay in Japan! 🍣 🍵 🇯🇵 |
I'm fine with merging this PR and maintain it together, but we need to figure out the proper way of doing this to make it spec compliant in the longer run. From the API perspective, I think we should expose what's written in the spec, i.e. an API for |
Thanks! I will reconsider how the API should be. |
ConstructorParser
(and LiteralParser
) to handle escape sequence in RegExp('pat')
I finally settled on this API. let options = Options { pattern_span_offset, flags_span_offset }; // Both optional
LiteralParser::new(allocator, "year\d+", Some("v"), options).parse()
ConstructorParser::new(allocator, "\"year\\d+\"", Some("\"v\""), options).parse() To minimize diff, old APIs( Once this PR is merged, I'm going to remove them and migrate usages in After that, let's update |
Merge activity
|
…r`) to handle escape sequence in RegExp('pat') (#6635) Preparation for #6141 `oxc_regular_expression` can already parse and validate both `/regexp-literal/` and `new RegExp("string-literal")`. But one thing that is not well-supported was reporting `Span` for the `RegExp("string-literal-with-\\escape")` case. For example, these two cases produce the same `RegExp` instances in JavaScript: - `/\d+/` - `new RegExp("\\d+")` For now, mainly in `oxc_linter`, the latter case is parsed with `oxc_parser` -> `ast::literal::StringLiteral` AST node -> `value` property. At this point, escape sequences are resolved(!), `oxc_regular_expression` can handle aligned `&str` as an argument without any problem in both cases. However, in terms of `Span` representation, these cases should be handled differently because of the `\\` in string literals... As a result, the parsed AST's `Span` for `new RegExp("string-literal")` is not accurate if it contains escape sequences. e.g. https://github.com/oxc-project/oxc/blob/a01a5dfdafb9cd536cb87867697e3ae43b1990e6/crates/oxc_linter/src/snapshots/no_invalid_regexp.snap#L118-L122 Each time the `\` appears, the subsequent position is shifted. `_` should be placed under `*` in this case. So... to resolve this issue, we need to implement `string_literal_parser` first, and use them as reading units of `oxc_regular_expression`.
Follow up #6635 - [x] Remove old APIs - [x] Update linter usage - [x] Update parser usage - [x] Update transformer usage
Preparation for #6141
oxc_regular_expression
can already parse and validate both/regexp-literal/
andnew RegExp("string-literal")
.But one thing that is not well-supported was reporting
Span
for theRegExp("string-literal-with-\\escape")
case.For example, these two cases produce the same
RegExp
instances in JavaScript:/\d+/
new RegExp("\\d+")
For now, mainly in
oxc_linter
, the latter case is parsed withoxc_parser
->ast::literal::StringLiteral
AST node ->value
property.At this point, escape sequences are resolved(!),
oxc_regular_expression
can handle aligned&str
as an argument without any problem in both cases.However, in terms of
Span
representation, these cases should be handled differently because of the\\
in string literals...As a result, the parsed AST's
Span
fornew RegExp("string-literal")
is not accurate if it contains escape sequences.e.g.
oxc/crates/oxc_linter/src/snapshots/no_invalid_regexp.snap
Lines 118 to 122 in a01a5df
Each time the
\
appears, the subsequent position is shifted._
should be placed under*
in this case.So... to resolve this issue, we need to implement
string_literal_parser
first, and use them as reading units ofoxc_regular_expression
.