Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong handling of UTF-8 text in the XML module #13703

Closed
Singond opened this issue Jul 24, 2023 · 4 comments · Fixed by #13705
Closed

Wrong handling of UTF-8 text in the XML module #13703

Singond opened this issue Jul 24, 2023 · 4 comments · Fixed by #13705
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:serialization

Comments

@Singond
Copy link

Singond commented Jul 24, 2023

I am experiencing issues when parsing HTML containing non-ASCII characters. For example:

require "xml"

str = "<p>České psaní</p>"
html = XML.parse_html(str)
el = html.xpath("/html/body/p")
if el.is_a?(XML::NodeSet)
	print(el[0].text)
end

The expected output is České psaní, but I get Äeské psaní, which seems as if the correct UTF-8 bytes were re-interpreted as ISO-8859-1 (or similar) encoding and converted into UTF-8 again.

The issue appeared all of sudden in a binary which had previously worked correctly, i.e. without recompilation of my program. Therefore, I suspect an external library is to blame, perhaps libxml2, but I'm posting here because I am not sure, and also because I do not know how to test libxml2 directly.

Can anybody confirm the issue? Is this a Crystal issue or not?

Thanks in advance.

My specs are:

Operating system: Manjaro (Linux 5.15.120-1)
Crystal version:

Crystal 1.8.2 (2023-05-11)

LLVM: 15.0.7
Default target: x86_64-pc-linux-gnu

libxml2 version: 2.11.4-1

@Singond Singond added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Jul 24, 2023
@straight-shoota
Copy link
Member

I can reproduce this behaviour with libxml2 2.11.4. But 2.10.4 is still correct.
libxml2's changelog notes for 2.11

Refactoring has begun on some buffering and encoding code with the goal of
simplifying this part of the code base and improving error reporting.

Maybe that broke something?

This seems to appear only with htmlReadMemory, not xmlReadMemory.

@straight-shoota
Copy link
Member

I have filed an upstream issue: https://gitlab.gnome.org/GNOME/libxml2/-/issues/570
Let's see what they have to say.

@straight-shoota
Copy link
Member

According to upstream maintainer this is caused by bug fix for htmlReadMemory in 2.11 and intended behaviour. https://gitlab.gnome.org/GNOME/libxml2/-/issues/570#note_1799059
We're passing a NULL value as encoding to htmlParseMemory in order to use the default encoding. That was always supposed to be ISO-8859-1, but due to the bug it was UTF-8 instead which meant our bindings worked correctly. Now they fixed that and our bindings are broken because Crystal strings are UTF-8.
This should be easy to fix by explicitly stating the encoding. We should probably do that for unaffected methods as well, just to be safe.

@Singond
Copy link
Author

Singond commented Jul 25, 2023

Thanks for the prompt resolution. I am confident the issue is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:serialization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants