Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(regular_expression): Remove u16 reader, mark surrogate pairs instead #5210

Closed
wants to merge 5 commits into from

Conversation

leaysgur
Copy link
Contributor

@leaysgur leaysgur commented Aug 26, 2024

Currently, in non-unicode mode, the pattern string is handled through encode_utf16().

However,

  • The parsing process itself is performed only on BMP characters
  • Presence of surrogate pairs appears in the AST is limited
  • Find the corresponding Span offset is also complicated
    • Besides, there is no corresponding position for each surrogate pair...

From the above, can't we unify the Reader in unicode mode?
Based on the hypothesis, I'd like to try to find out.

Copy link

graphite-app bot commented Aug 26, 2024

Your org has enabled the Graphite merge queue for merging into main

Add the label “merge” to the PR and Graphite will automatically add it to the merge queue when it’s ready to merge. Or use the label “hotfix” to add to the merge queue as a hot fix.

You must have a Graphite account and log in to Graphite in order to use the merge queue. Sign up using this link.

@Boshen
Copy link
Member

Boshen commented Aug 26, 2024

We don't have a codspeed benchmark for it. Let me enable regex parsing in the benchmark.

Copy link

codspeed-hq bot commented Aug 26, 2024

CodSpeed Performance Report

Merging #5210 will not alter performance

Comparing perf-regexp (b75b388) with main (2fbc283)

Summary

✅ 29 untouched benchmarks

@leaysgur
Copy link
Contributor Author

Oh, really? Thank you! 🙈

Although, I was thinking of withdrawing this PR.

Not using encode_utf16() means we never iterate over the units that consists surrogate pairs in the first place.
This would result in ignoring certain specification syntax.

As a result, while test262 is currently passing (though it might just be an oversight), if there isn't a significant improvement, I think it would be better to keep behavior more spec like.

@Boshen
Copy link
Member

Boshen commented Aug 26, 2024

Oh, really? Thank you! 🙈

Although, I was thinking of withdrawing this PR.

Not using encode_utf16() means we never iterate over the units that consists surrogate pairs in the first place. This would result in ignoring certain specification syntax.

As a result, while test262 is currently passing (though it might just be an oversight), if there isn't a significant improvement, I think it would be better to keep behavior more spec like.

Although, I was thinking of withdrawing this PR.

I don't know about the details, feel free to make your own judgements.

@leaysgur leaysgur closed this Aug 26, 2024
@leaysgur leaysgur deleted the perf-regexp branch August 26, 2024 05:34
leaysgur added a commit that referenced this pull request Aug 26, 2024
Boshen pushed a commit that referenced this pull request Aug 26, 2024
I've never seen but `/a{9007199254740991}/` is valid and this is the maximum value for quantifier.

\+ left comment about #5210 experiment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants