-
-
Notifications
You must be signed in to change notification settings - Fork 905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JRuby: UtfHelpper.writeCharToUtf8 cannot handle unicode supplementary character #2410
Comments
@AlexSun1995 Thanks for opening this issue, I'll try to help. Can you please take a moment to provide Ruby code that reproduces the issue you're seeing? Please see https://raw.githubusercontent.com/sparklemotion/nokogiri/main/.github/ISSUE_TEMPLATE/3-bug-report.md for guidance on how to do this. |
@AlexSun1995 Please help me reproduce what you're seeing? |
On CRuby 3.1.0 I see the correct output:
On JRuby 9.3.2.0 I see:
So, yes, this appears to be an issue with the underlying JRuby implementation. I will take a look. |
Looks like recent JDK implementations have moved to use |
Good news: by moving to Unfortunately, the previous c14n code was a patched version to work within the original c14n API design borrowed from libxml2, and so the block parameter passed to |
I've got a branch that I want to push to get more eyeballs on the solution which updates to santuario xmlsec but pulls in multiple dependencies and breaks the block parameter to Will push the branch tomorrow as it's based on the branch in #2546 and I want that to go green and get merged first. |
See #2547 for the PR |
nokogiri/ext/java/nokogiri/internals/c14n/UtfHelpper.java
Line 51 in 55029bf
since the Canonicalizer process input String character by character. Java uses 16 bits to represent a character; when the input string contains Unicode characters whose code pen are larger than 0Xffff(65535) it will be split into two char, since neither char will not be
recognized, the Unicode characters will be transferred to 2 ??(3f) instead.
for example, if I want to canonicalize an input that contains
𡏅 via c14n,
in the output, 𡏅 will be replaced with ??
The text was updated successfully, but these errors were encountered: