Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing HTML numeric entities #645

Merged
merged 1 commit into from
Mar 16, 2024

Conversation

DerZade
Copy link
Contributor

@DerZade DerZade commented Mar 7, 2024

Purpose / Goal

This PR adds a htmlNumericEntities option which adds support for parsing HTML numeric entities.

Type

Please mention the type of PR

  • Bug Fix
  • Refactoring / Technology upgrade
  • New Feature

Benchmarks

Before

Running Suite: XML Parser benchmark
fxp v3 : 43885.453114704054 requests/second
fxp : 28444.884306095257 requests/second
fxp - preserve order : 35582.44370634719 requests/second
xmlbuilder2 : 12279.374624117605 requests/second
xml2js  : 17997.982194339696 requests/second

After

Running Suite: XML Parser benchmark
fxp v3 : 56829.18072321323 requests/second
fxp : 35711.03397186936 requests/second
fxp - preserve order : 35048.53534051172 requests/second
xmlbuilder2 : 12565.714384084644 requests/second
xml2js  : 17898.357168570532 requests/second

@amitguptagwl
Copy link
Member

Thanks for your PR

I've a few suggestions

  1. I believe we don't need another option. It can be done with htmlEntities flag only.
  2. we can add an extra property to htmlEntities object, say num or hex. And the value can be set to a function. In the loop, we can check the type if it is string replace it as today. Otherwise, run the function as per setting in the object. It'll give the chance to disable it easily.
  3. both expressions (Eg /&#([0-9]+);/) has + that make it unsafe to match very long string and will impact performance too.

Let me know your thoughts

@DerZade
Copy link
Contributor Author

DerZade commented Mar 12, 2024

@amitguptagwl I just force pushed and updated my code to reflect your suggestions:

  • Removed + from both expressions. Decimal is now limited to 7 chars and hexadecimal to 6 chars. (The max code point is U+10FFFF or 1,114,112 in decimal)
  • Removed the extra htmlNumericEntities option
  • Added both regex and replacers to htmlEntities-object. I choose num_dec and num_hex as keys, since there is a named char reference num;
  • I didn't even have to check the type for string / funcion as you suggested, since String.prototype.replace already supports functions as the replacer value.

@coveralls
Copy link

Coverage Status

coverage: 98.257% (+0.006%) from 98.251%
when pulling b2f3c96 on DerZade:master
into 072b2b0 on NaturalIntelligence:master.

@amitguptagwl amitguptagwl merged commit 391f24f into NaturalIntelligence:master Mar 16, 2024
6 of 7 checks passed
@amitguptagwl
Copy link
Member

It's live now. You can see the entry in change logs

@DerZade
Copy link
Contributor Author

DerZade commented Mar 17, 2024

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants