regular_expression: report character span offsets not byte offsets #6141

camchenry · 2024-09-28T12:35:47Z

see body of #6129

camchenry · 2024-09-28T22:01:14Z

@leaysgur Maybe you have some insight on this. I was having trouble correctly parsing control characters because the span offsets didn't match the source when unicode escapes were involved. I'm not certain how to fix the issue though.

leaysgur · 2024-09-29T01:03:36Z

Yes, I think I know what is going on here.
I will write down the details with a little verification when the weekend is over.

leaysgur · 2024-09-30T05:09:46Z

TL;DR

This is an issue related to "escape sequences" in JavaScript string literals.

When using the RegExp constructor instead of /regex/ literals
And if an escape sequence is included in the string literal

The corresponding position might not be correctly retrieved from the parsed Span.

Simplified

If there is a "file" containing:

"O\"X\"C"

and you want to slice each O, X, and C, you need to calculate the slice position like this.

fn main() {
    let source_text = std::fs::read_to_string("./test.js").unwrap();
    assert_eq!("O", &source_text[1..2]);
                 // &source_text[2..3]; \ 👈🏻
                 // &source_text[3..4]; "
    assert_eq!("X", &source_text[4..5]);
                 // &source_text[5..6]; \ 👈🏻
                 // &source_text[6..7]; "
    assert_eq!("C", &source_text[7..8]);
}

You need to include \ as part of the slicing index.

However, if you're just handling the "string":

fn main() {
    let source_text = "O\"X\"C";
    assert_eq!("O", &source_text[0..1]);
                 // &source_text[1..2]; "
    assert_eq!("X", &source_text[2..3]);
                 // &source_text[3..4]; "
    assert_eq!("C", &source_text[4..5]);
}

There's no need to consider the \.

This difference is causing problems when a diagnostic tries to slice the file.

Current status

As an implementation detail, oxc_regular_expression is not aware of these escape sequences.

In most cases, the current code works with an AST with oxc_parser's StringLiteral and its value is passed to oxc_regular_expression.

At this point from oxc_regular_expression, /O"X"C/ and new RegExp("O\"X\"C") seems equivalent, and new RegExp('O"X"C') is also the same. (Both "\"'`" and '"\'`' are normalized as "\"'`" in StringLiteral.value.)

Thus, the regular expression can be parsed without needing to be aware of these differences.

However, when attempting to slice the original code string using the parsed Span, issues may arise.

As seen in the original PR:

new RegExp('\\u{1111}*\\x1F', 'u')
            ^         ^

These 2 \ cause a misalignment as a result.

At that time, I was aware of this issue but was only focused on parsing.

#3824 (comment)
#5106

I was satisfied with oxc_parser resolving the escapes.

How to fix

In my view, the only solution is to update oxc_regular_expression to be aware of escape sequences.

When parsing a regular expression from the RegExp constructor, use string_literal.span.source_text(&self.source_text) instead of string_literal.value
- This is the entire string, including quotes
- Then pass inner_text, quote_type, span_offset?
In oxc_regular_expression:
- Introduce a lexer layer in front of the current Reader implementation
- Tokenize the text based on the quote type, similar to oxc_parser/lexer
  - oxc/crates/oxc_parser/src/lexer/string.rs
    
    Line 155 in 21b08ba
    
    impl<'a> Lexer<'a> {
- Should consider template literals surrounded by ` too?
  - oxc/crates/oxc_parser/src/lexer/template.rs
    
    Line 17 in 21b08ba
    
    impl<'a> Lexer<'a> {
- Provide a mode toggle to use the lexer or not?
  - When parsing a regex literal, lexing is unnecessary
  - Also unnecessary when using as a standalone regex parser

Are there any other approaches to consider...?

camchenry · 2024-09-30T17:19:37Z

In my view, the only solution is to update oxc_regular_expression to be aware of escape sequences.

This makes sense to me, I think it would it also align us with the eslint-community/regexpp parser package. Looking at the examples there, it seems like the spans represent the source text, not just its actual value. They also include the raw string for reference, though I don't think we need to do that.

https://github.com/eslint-community/regexpp/blob/4d4787a2479b5c7f4984227c40123749898fb8a2/test/fixtures/parser/literal/basic-valid-2015-u.json#L6133-L6176

Should consider template literals surrounded by ` too?

I don't think we need to handle this necessarily. The ESLint parser doesn't bother to even check template literals for most rules it seems:

Boshen · 2024-10-01T02:37:14Z

I agree, we need to make it parse from raw string instead of the interpreted string value.

leaysgur · 2024-10-01T03:56:44Z

Thank you both. 👍🏻

I push_front this task to my task queue. 📚

Preparation for #6141 - Keep `enum` size + add size asserts tests - Arrange AST related directories - Renaming

leaysgur · 2024-10-03T07:25:14Z

OK..., this might not be as easy as expected.

Does this mean we need oxc_string_literal::Parser to fix this strictly? 🤔

https://tc39.es/ecma262/2024/#prod-StringLiteral

leaysgur · 2024-10-15T09:37:57Z

Progress updates:

Implement StringLiteral parser
- Basic implementation
- Fix up with more tests
Handle /literal/ and new RegExp("string") equally
Update Span handling

…r`) to handle escape sequence in RegExp('pat') (#6635) Preparation for #6141 `oxc_regular_expression` can already parse and validate both `/regexp-literal/` and `new RegExp("string-literal")`. But one thing that is not well-supported was reporting `Span` for the `RegExp("string-literal-with-\\escape")` case. For example, these two cases produce the same `RegExp` instances in JavaScript: - `/\d+/` - `new RegExp("\\d+")` For now, mainly in `oxc_linter`, the latter case is parsed with `oxc_parser` -> `ast::literal::StringLiteral` AST node -> `value` property. At this point, escape sequences are resolved(!), `oxc_regular_expression` can handle aligned `&str` as an argument without any problem in both cases. However, in terms of `Span` representation, these cases should be handled differently because of the `\\` in string literals... As a result, the parsed AST's `Span` for `new RegExp("string-literal")` is not accurate if it contains escape sequences. e.g. https://github.com/oxc-project/oxc/blob/a01a5dfdafb9cd536cb87867697e3ae43b1990e6/crates/oxc_linter/src/snapshots/no_invalid_regexp.snap#L118-L122 Each time the `\` appears, the subsequent position is shifted. `_` should be placed under `*` in this case. So... to resolve this issue, we need to implement `string_literal_parser` first, and use them as reading units of `oxc_regular_expression`.

leaysgur · 2024-10-22T05:52:26Z

Closes by #6635 and #6741 .

@camchenry If you are planning to do something, please refer to this. 😉

oxc/crates/oxc_linter/src/rules/eslint/no_control_regex.rs

Lines 129 to 137 in 8032813

    
           let parser = ConstructorParser::new( 
        
               &allocator, 
        
               pattern_span.source_text(ctx.source_text()), 
        
               flags_text, 
        
               Options { 
        
                   pattern_span_offset: pattern_span.start, 
        
                   flags_span_offset: flags_span.map_or(0, |span| span.start), 
        
               }, 
        
           );

camchenry added the C-bug label Sep 28, 2024

leaysgur self-assigned this Oct 1, 2024

leaysgur mentioned this issue Oct 2, 2024

refactor(regular_expression): Misc fixes #6234

Merged

Boshen pushed a commit that referenced this issue Oct 2, 2024

refactor(regular_expression): Misc fixes (#6234)

Loading
Loading status checks…

acab777

Preparation for #6141 - Keep `enum` size + add size asserts tests - Arrange AST related directories - Renaming

camchenry mentioned this issue Oct 13, 2024

fix(linter/no-control-regex): allow capture group references #6529

Merged

leaysgur mentioned this issue Oct 17, 2024

feat(regular_expression): Intro ConstructorParser(and LiteralParser) to handle escape sequence in RegExp('pat') #6635

Merged

leaysgur closed this as completed Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

regular_expression: report character span offsets not byte offsets #6141

regular_expression: report character span offsets not byte offsets #6141

camchenry commented Sep 28, 2024

camchenry commented Sep 28, 2024

leaysgur commented Sep 29, 2024

leaysgur commented Sep 30, 2024

camchenry commented Sep 30, 2024

Boshen commented Oct 1, 2024

leaysgur commented Oct 1, 2024

leaysgur commented Oct 3, 2024

leaysgur commented Oct 15, 2024 •

edited

Loading

leaysgur commented Oct 22, 2024

regular_expression: report character span offsets not byte offsets #6141

regular_expression: report character span offsets not byte offsets #6141

Comments

camchenry commented Sep 28, 2024

camchenry commented Sep 28, 2024

leaysgur commented Sep 29, 2024

leaysgur commented Sep 30, 2024

TL;DR

Simplified

Current status

How to fix

camchenry commented Sep 30, 2024

Boshen commented Oct 1, 2024

leaysgur commented Oct 1, 2024

leaysgur commented Oct 3, 2024

leaysgur commented Oct 15, 2024 • edited Loading

leaysgur commented Oct 22, 2024

leaysgur commented Oct 15, 2024 •

edited

Loading