Skip to content

Even with -utf8 tidy replaces UTF8 code U+00A0 into numeric entity   #871

Closed
@mcepl

Description

@mcepl

When running tidy -i -m -xml -utf8 to canonicalize XML file I get this diff (among many other things):

@@ -15389,7 +15389,7 @@ xsi:schemaLocation="http://www.bibletechnologies.net/2003/OSIS/namespace z:/osis
             <verse sID="Lev.11.13" osisID="Lev.11.13" />Toto jsou
             ptáci, jichž se budete štítit. Nesmějí se jíst, jsou
             ohavní: 
-            <note>některé živočišné druhy v následujících
+            <note>některé živočišné druhy v&#160;následujících
             výčtech nelze určit s jistotou</note>orel, orlosup,
             mořský orel, 
             <verse eID="Lev.11.13" />

The character after the preposition “v” is the non-breakable space (U+00A0). When I say -utf8 it means in my opinion that both input and output documents are in UTF8 and tidy should keep its dirty paws from changing characters, and especially it shouldn’t convert perfect UTF8 characters into numeric entities.

I am using tidy-5.6.0-1.7.x86_64 from the openSUSE package.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions