Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XmlUtil.escape does not handle Unicode corretly for codepoints above U+10000 #267

Closed
ghost opened this issue Mar 17, 2017 · 1 comment
Closed
Assignees
Labels

Comments

@ghost
Copy link

ghost commented Mar 17, 2017

XmlUtil.escape wrongly assumes that every java char corresponds to one unicode symbol and does not escape the chars correctly.

We came across the symbol '😉' today in an Onix file, which was correctly escaped as 😉 and was correctly parsed as the winking face emoticon. The resulting java String contains two chars, not one, even though it's only a single symbol. XmlUtils.escape produces the invalid String �� from this.

It should have recognized the first char a being part of a two-char-pair, reconstructed the codepoint and output 😉.

@cboehme
Copy link
Member

cboehme commented Mar 18, 2017

Thanks for reporting this. It seems as if surrogate pairs for supplementary characters are not correctly encoded. I reckon this can be fixed by using String#codepoints() instead of String.chars() in XmlUtil.java line 99.

@cboehme cboehme self-assigned this Mar 18, 2017
@cboehme cboehme added the Bug label Mar 18, 2017
@cboehme cboehme added this to the metafacture 4.1.0 milestone Mar 18, 2017
blackwinter added a commit that referenced this issue Dec 13, 2024
blackwinter pushed a commit that referenced this issue Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant