Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JRuby: UtfHelpper.writeCharToUtf8 cannot handle unicode supplementary character #2410

Open
AlexSun1995 opened this issue Jan 5, 2022 · 9 comments

Comments

@AlexSun1995
Copy link

AlexSun1995 commented Jan 5, 2022

writeCharToUtf8(final char c, final OutputStream out) throws IOException

since the Canonicalizer process input String character by character. Java uses 16 bits to represent a character; when the input string contains Unicode characters whose code pen are larger than 0Xffff(65535) it will be split into two char, since neither char will not be
recognized, the Unicode characters will be transferred to 2 ??(3f) instead.

for example, if I want to canonicalize an input that contains
𡏅 via c14n,
in the output, 𡏅 will be replaced with ??

@flavorjones
Copy link
Member

@AlexSun1995 Thanks for opening this issue, I'll try to help.

Can you please take a moment to provide Ruby code that reproduces the issue you're seeing? Please see https://raw.githubusercontent.com/sparklemotion/nokogiri/main/.github/ISSUE_TEMPLATE/3-bug-report.md for guidance on how to do this.

@flavorjones
Copy link
Member

@AlexSun1995 Please help me reproduce what you're seeing?

@AlexSun1995
Copy link
Author

AlexSun1995 commented Jan 19, 2022

sorry for late,
notice such codes below:

require 'nokogiri'

doc = Nokogiri.XML('<foo><bar />𡏅</foo>', nil, 'EUC-JP')
puts doc.canonicalize()

the output would be:
??
image

running on jruby 9.3.2.0, I guess it's some java dependencies's bug

@flavorjones
Copy link
Member

On CRuby 3.1.0 I see the correct output:

<foo><bar></bar>陝</foo>

On JRuby 9.3.2.0 I see:

<foo><bar></bar>??</foo>

So, yes, this appears to be an issue with the underlying JRuby implementation. I will take a look.

@flavorjones flavorjones changed the title UtfHelpper.writeCharToUtf8 cannot handle unicode supplementary character JRuby: UtfHelpper.writeCharToUtf8 cannot handle unicode supplementary character Jan 19, 2022
@flavorjones
Copy link
Member

Looks like recent JDK implementations have moved to use writeCodepointToUtf8, we can probably do that.

@flavorjones
Copy link
Member

I'd like to remove the c14n code from Nokogiri and instead use the upstream JDK jars for this. That will be much easier to do once #1253 and #1967 get merged.

@flavorjones
Copy link
Member

Good news: by moving to org.apache.santuario:xmlsec:2.3.0, the encoding test you provided passes on JRuby.

Unfortunately, the previous c14n code was a patched version to work within the original c14n API design borrowed from libxml2, and so the block parameter passed to Document#canonicalize is broken and I'll need some time to work around it in order to implement Node#canonicalize for both JRuby and CRuby.

@flavorjones flavorjones added this to the v1.14.0 milestone Jan 31, 2022
flavorjones added a commit that referenced this issue May 11, 2022
@flavorjones flavorjones removed this from the v1.14.0 milestone May 12, 2022
@flavorjones
Copy link
Member

I've got a branch that I want to push to get more eyeballs on the solution which updates to santuario xmlsec but pulls in multiple dependencies and breaks the block parameter to Document#canonicalize, but fixes some of the encoding problems. Won't make it into v1.14.0 but will be a candidate for v1.15.0 if we can work through the changes.

Will push the branch tomorrow as it's based on the branch in #2546 and I want that to go green and get merged first.

flavorjones added a commit that referenced this issue May 12, 2022
flavorjones added a commit that referenced this issue Jun 7, 2022
flavorjones added a commit that referenced this issue Jun 8, 2022
@flavorjones flavorjones added this to the v1.14.0 milestone Jun 8, 2022
flavorjones added a commit that referenced this issue Jun 8, 2022
flavorjones added a commit that referenced this issue Aug 26, 2022
@flavorjones
Copy link
Member

See #2547 for the PR

@flavorjones flavorjones removed this from the v1.14.0 milestone Aug 26, 2022
@flavorjones flavorjones added this to the v1.15.0 milestone Aug 26, 2022
flavorjones added a commit that referenced this issue Aug 27, 2022
flavorjones added a commit that referenced this issue Oct 16, 2022
flavorjones added a commit that referenced this issue Oct 16, 2022
@flavorjones flavorjones modified the milestones: v1.15.0, v1.16.0 Apr 28, 2023
@flavorjones flavorjones removed this from the v1.17.0 milestone Dec 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants