Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regular_expression: report character span offsets not byte offsets #6141

Closed
camchenry opened this issue Sep 28, 2024 · 9 comments
Closed

regular_expression: report character span offsets not byte offsets #6141

camchenry opened this issue Sep 28, 2024 · 9 comments
Assignees
Labels
C-bug Category - Bug

Comments

@camchenry
Copy link
Contributor

see body of #6129

@camchenry camchenry added the C-bug Category - Bug label Sep 28, 2024
@camchenry
Copy link
Contributor Author

@leaysgur Maybe you have some insight on this. I was having trouble correctly parsing control characters because the span offsets didn't match the source when unicode escapes were involved. I'm not certain how to fix the issue though.

@leaysgur
Copy link
Contributor

Yes, I think I know what is going on here.
I will write down the details with a little verification when the weekend is over.

@leaysgur
Copy link
Contributor

TL;DR

This is an issue related to "escape sequences" in JavaScript string literals.

  • When using the RegExp constructor instead of /regex/ literals
  • And if an escape sequence is included in the string literal

The corresponding position might not be correctly retrieved from the parsed Span.

Simplified

If there is a "file" containing:

"O\"X\"C"

and you want to slice each O, X, and C, you need to calculate the slice position like this.

fn main() {
    let source_text = std::fs::read_to_string("./test.js").unwrap();
    assert_eq!("O", &source_text[1..2]);
                 // &source_text[2..3]; \ 👈🏻
                 // &source_text[3..4]; "
    assert_eq!("X", &source_text[4..5]);
                 // &source_text[5..6]; \ 👈🏻
                 // &source_text[6..7]; "
    assert_eq!("C", &source_text[7..8]);
}

You need to include \ as part of the slicing index.

However, if you're just handling the "string":

fn main() {
    let source_text = "O\"X\"C";
    assert_eq!("O", &source_text[0..1]);
                 // &source_text[1..2]; "
    assert_eq!("X", &source_text[2..3]);
                 // &source_text[3..4]; "
    assert_eq!("C", &source_text[4..5]);
}

There's no need to consider the \.

This difference is causing problems when a diagnostic tries to slice the file.

Current status

As an implementation detail, oxc_regular_expression is not aware of these escape sequences.

In most cases, the current code works with an AST with oxc_parser's StringLiteral and its value is passed to oxc_regular_expression.

At this point from oxc_regular_expression, /O"X"C/ and new RegExp("O\"X\"C") seems equivalent, and new RegExp('O"X"C') is also the same. (Both "\"'`" and '"\'`' are normalized as "\"'`" in StringLiteral.value.)

Thus, the regular expression can be parsed without needing to be aware of these differences.

However, when attempting to slice the original code string using the parsed Span, issues may arise.

As seen in the original PR:

new RegExp('\\u{1111}*\\x1F', 'u')
            ^         ^

These 2 \ cause a misalignment as a result.

At that time, I was aware of this issue but was only focused on parsing.

#3824 (comment)
#5106

I was satisfied with oxc_parser resolving the escapes.

How to fix

In my view, the only solution is to update oxc_regular_expression to be aware of escape sequences.

  • When parsing a regular expression from the RegExp constructor, use string_literal.span.source_text(&self.source_text) instead of string_literal.value
    • This is the entire string, including quotes
    • Then pass inner_text, quote_type, span_offset?
  • In oxc_regular_expression:
    • Introduce a lexer layer in front of the current Reader implementation
    • Tokenize the text based on the quote type, similar to oxc_parser/lexer
    • Should consider template literals surrounded by ` too?
    • Provide a mode toggle to use the lexer or not?
      • When parsing a regex literal, lexing is unnecessary
      • Also unnecessary when using as a standalone regex parser

Are there any other approaches to consider...?

@camchenry
Copy link
Contributor Author

In my view, the only solution is to update oxc_regular_expression to be aware of escape sequences.

This makes sense to me, I think it would it also align us with the eslint-community/regexpp parser package. Looking at the examples there, it seems like the spans represent the source text, not just its actual value. They also include the raw string for reference, though I don't think we need to do that.

https://github.com/eslint-community/regexpp/blob/4d4787a2479b5c7f4984227c40123749898fb8a2/test/fixtures/parser/literal/basic-valid-2015-u.json#L6133-L6176

  • Should consider template literals surrounded by ` too?

I don't think we need to handle this necessarily. The ESLint parser doesn't bother to even check template literals for most rules it seems:

Image

@Boshen
Copy link
Member

Boshen commented Oct 1, 2024

I agree, we need to make it parse from raw string instead of the interpreted string value.

@leaysgur
Copy link
Contributor

leaysgur commented Oct 1, 2024

Thank you both. 👍🏻

I push_front this task to my task queue. 📚

@leaysgur leaysgur self-assigned this Oct 1, 2024
Boshen pushed a commit that referenced this issue Oct 2, 2024
Preparation for #6141

- Keep `enum` size + add size asserts tests
- Arrange AST related directories
- Renaming
@leaysgur
Copy link
Contributor

leaysgur commented Oct 3, 2024

OK..., this might not be as easy as expected.

image

Does this mean we need oxc_string_literal::Parser to fix this strictly? 🤔

https://tc39.es/ecma262/2024/#prod-StringLiteral

@leaysgur
Copy link
Contributor

leaysgur commented Oct 15, 2024

Progress updates:

  • Implement StringLiteral parser
    • Basic implementation
    • Fix up with more tests
  • Handle /literal/ and new RegExp("string") equally
  • Update Span handling

Boshen pushed a commit that referenced this issue Oct 21, 2024
…r`) to handle escape sequence in RegExp('pat') (#6635)

Preparation for #6141

`oxc_regular_expression` can already parse and validate both `/regexp-literal/` and `new RegExp("string-literal")`.

But one thing that is not well-supported was reporting `Span` for the `RegExp("string-literal-with-\\escape")` case.

For example, these two cases produce the same `RegExp` instances in JavaScript:

- `/\d+/`
- `new RegExp("\\d+")`

For now, mainly in `oxc_linter`, the latter case is parsed with `oxc_parser` -> `ast::literal::StringLiteral` AST node -> `value` property.

At this point, escape sequences are resolved(!), `oxc_regular_expression` can handle aligned `&str` as an argument without any problem in both cases.

However, in terms of `Span` representation, these cases should be handled differently because of the `\\` in string literals...

As a result, the parsed AST's `Span` for `new RegExp("string-literal")` is not accurate if it contains escape sequences.

e.g. https://github.com/oxc-project/oxc/blob/a01a5dfdafb9cd536cb87867697e3ae43b1990e6/crates/oxc_linter/src/snapshots/no_invalid_regexp.snap#L118-L122

Each time the `\` appears, the subsequent position is shifted. `_` should be placed under `*` in this case.

So... to resolve this issue, we need to implement `string_literal_parser` first, and use them as reading units of `oxc_regular_expression`.
@leaysgur
Copy link
Contributor

Closes by #6635 and #6741 .

@camchenry If you are planning to do something, please refer to this. 😉

let parser = ConstructorParser::new(
&allocator,
pattern_span.source_text(ctx.source_text()),
flags_text,
Options {
pattern_span_offset: pattern_span.start,
flags_span_offset: flags_span.map_or(0, |span| span.start),
},
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category - Bug
Projects
None yet
Development

No branches or pull requests

3 participants