Wrong handling of UTF-8 text in the XML module #13703

Singond · 2023-07-24T21:56:48Z

I am experiencing issues when parsing HTML containing non-ASCII characters. For example:

require "xml"

str = "<p>České psaní</p>"
html = XML.parse_html(str)
el = html.xpath("/html/body/p")
if el.is_a?(XML::NodeSet)
	print(el[0].text)
end

The expected output is České psaní, but I get ÄeskÃ© psanÃ, which seems as if the correct UTF-8 bytes were re-interpreted as ISO-8859-1 (or similar) encoding and converted into UTF-8 again.

The issue appeared all of sudden in a binary which had previously worked correctly, i.e. without recompilation of my program. Therefore, I suspect an external library is to blame, perhaps libxml2, but I'm posting here because I am not sure, and also because I do not know how to test libxml2 directly.

Can anybody confirm the issue? Is this a Crystal issue or not?

Thanks in advance.

My specs are:

Operating system: Manjaro (Linux 5.15.120-1)
Crystal version:

Crystal 1.8.2 (2023-05-11)

LLVM: 15.0.7
Default target: x86_64-pc-linux-gnu

libxml2 version: 2.11.4-1

The text was updated successfully, but these errors were encountered:

straight-shoota · 2023-07-25T07:47:31Z

I can reproduce this behaviour with libxml2 2.11.4. But 2.10.4 is still correct.
libxml2's changelog notes for 2.11

Refactoring has begun on some buffering and encoding code with the goal of
simplifying this part of the code base and improving error reporting.

Maybe that broke something?

This seems to appear only with htmlReadMemory, not xmlReadMemory.

straight-shoota · 2023-07-25T08:29:56Z

I have filed an upstream issue: https://gitlab.gnome.org/GNOME/libxml2/-/issues/570
Let's see what they have to say.

straight-shoota · 2023-07-25T12:04:46Z

According to upstream maintainer this is caused by bug fix for htmlReadMemory in 2.11 and intended behaviour. https://gitlab.gnome.org/GNOME/libxml2/-/issues/570#note_1799059
We're passing a NULL value as encoding to htmlParseMemory in order to use the default encoding. That was always supposed to be ISO-8859-1, but due to the bug it was UTF-8 instead which meant our bindings worked correctly. Now they fixed that and our bindings are broken because Crystal strings are UTF-8.
This should be easy to fix by explicitly stating the encoding. We should probably do that for unaffected methods as well, just to be safe.

Singond · 2023-07-25T22:44:51Z

Thanks for the prompt resolution. I am confident the issue is solved.

Singond added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Jul 24, 2023

Blacksmoke16 added the topic:stdlib:serialization label Jul 24, 2023

straight-shoota mentioned this issue Jul 25, 2023

Fix: Set encoding in XML.parse_html explicitly to UTF-8 #13705

Merged

straight-shoota closed this as completed in #13705 Aug 15, 2023

sefidel mentioned this issue Feb 3, 2024

[Bug] RSS: Japanese video description encoding issues iv-org/invidious#4256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong handling of UTF-8 text in the XML module #13703

Wrong handling of UTF-8 text in the XML module #13703

Singond commented Jul 24, 2023 •

edited

Loading

straight-shoota commented Jul 25, 2023

straight-shoota commented Jul 25, 2023

straight-shoota commented Jul 25, 2023

Singond commented Jul 25, 2023

Wrong handling of UTF-8 text in the XML module #13703

Wrong handling of UTF-8 text in the XML module #13703

Comments

Singond commented Jul 24, 2023 • edited Loading

straight-shoota commented Jul 25, 2023

straight-shoota commented Jul 25, 2023

straight-shoota commented Jul 25, 2023

Singond commented Jul 25, 2023

Singond commented Jul 24, 2023 •

edited

Loading