Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Optimize String.prototype.normalize #2848

Closed
wants to merge 2 commits into from

Conversation

jedel1043
Copy link
Member

We currently use unicode_normalization to handle the String.prototype.normalize method. However, the crate doesn't support UTF-16 as a first class string, so we had to do some hacks by converting the valid parts of a string to UTF-8, normalizing each one, encoding back to UTF-16 and concatenating everything with the unpaired surrogates within. All of this is obviously suboptimal for performance, which is why I leveraged the icu_normalizer, which does support UTF-16 input, to replace our current implementation.

Additionally, this allows users to override the default normalization data if the intl feature is enabled by providing the required data in the BoaProvider data provider.

@jedel1043 jedel1043 added dependencies Pull requests that update a dependency file builtins PRs and Issues related to builtins/intrinsics labels Apr 20, 2023
@jedel1043 jedel1043 added this to the v0.17.0 milestone Apr 20, 2023
@github-actions
Copy link

github-actions bot commented Apr 20, 2023

Test262 conformance changes

Test result main count PR count difference
Total 94,591 94,591 0
Passed 73,161 73,161 0
Ignored 17,530 17,530 0
Failed 3,900 3,900 0
Panics 0 0 0
Conformance 77.34% 77.34% 0.00%

@codecov
Copy link

codecov bot commented Apr 20, 2023

Codecov Report

Merging #2848 (f5615bc) into main (f97ad0d) will decrease coverage by 0.01%.
The diff coverage is 17.54%.

@@            Coverage Diff             @@
##             main    #2848      +/-   ##
==========================================
- Coverage   50.92%   50.92%   -0.01%     
==========================================
  Files         419      419              
  Lines       41780    41799      +19     
==========================================
+ Hits        21278    21286       +8     
- Misses      20502    20513      +11     
Impacted Files Coverage Δ
boa_engine/src/builtins/string/mod.rs 58.02% <0.00%> (ø)
boa_engine/src/context/mod.rs 45.45% <ø> (ø)
boa_icu_provider/src/lib.rs 75.00% <25.00%> (-25.00%) ⬇️
boa_engine/src/context/icu.rs 34.17% <47.36%> (+4.01%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@raskad raskad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the minimal data generated like this seems like a very nice solution. Looks very nice!

@raskad
Copy link
Member

raskad commented Apr 22, 2023

@jedel1043 I did not look into it much, do you think we could do the same for the UnicodeProperties ID_Start and ID_Continue that we need in boa_parser and currently generate tables ourselves in boa_unicode?

@jedel1043
Copy link
Member Author

@jedel1043 I did not look into it much, do you think we could do the same for the UnicodeProperties ID_Start and ID_Continue that we need in boa_parser and currently generate tables ourselves in boa_unicode?

Yep! There's the icu_properties crate that offers precisely that functionality. I'll make a PR this weekend :)

Copy link
Member

@HalidOdat HalidOdat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Looks good to me! :)

@jedel1043
Copy link
Member Author

bors r+

bors bot pushed a commit that referenced this pull request Apr 23, 2023
We currently use `unicode_normalization` to handle the `String.prototype.normalize` method. However, the crate doesn't support UTF-16 as a first class string, so we had to do some hacks by converting the valid parts of a string to UTF-8, normalizing each one, encoding back to UTF-16 and concatenating everything with the unpaired surrogates within. All of this is obviously suboptimal for performance, which is why I leveraged the `icu_normalizer`, which does support UTF-16 input, to replace our current implementation.

Additionally, this allows users to override the default normalization data if the `intl` feature is enabled by providing the required data in the `BoaProvider` data provider.
@bors
Copy link

bors bot commented Apr 23, 2023

Pull request successfully merged into main.

Build succeeded:

@bors bors bot changed the title Optimize String.prototype.normalize [Merged by Bors] - Optimize String.prototype.normalize Apr 23, 2023
@bors bors bot closed this Apr 23, 2023
@bors bors bot deleted the fast-normalizers branch April 23, 2023 08:40
bors bot pushed a commit that referenced this pull request Apr 24, 2023
As mentioned in #2848 (comment), this uses our new default ICU4X data to replace `char::is_start` and `char::is_continue` from the `boa_unicode` crate with the [`icu_properties`](https://crates.io/crates/icu_properties) crate.

Note that this doesn't deprecate `boa_unicode` yet, since that'll require some discussion about how to proceed with a now unused sub-crate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
builtins PRs and Issues related to builtins/intrinsics dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants