Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocBook reader misses some character entities like $ #7938

Closed
andreasabel opened this issue Feb 24, 2022 · 9 comments
Closed

DocBook reader misses some character entities like $ #7938

andreasabel opened this issue Feb 24, 2022 · 9 comments

Comments

@andreasabel
Copy link

The DocBook reader does not seem to understand some ampersand codes, like $:

$ pandoc -f docbook -t rst happy.xml 
Invalid XML:
UnresolvedEntityException (fromList ["dollar","ldquo","mdash","percnt","rdquo"])

Input file is: https://github.com/haskell/happy/blob/934763408f8df29180c63d7a2c69be0b84030967/doc/happy.xml
Adding option -s does not help.

MWE is:

<?xml version="1.0"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
   "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">

<book id="dollar">
  &dollar;
</book>

Pandoc version?
What version of pandoc are you using, on what OS?

$ pandoc --version
pandoc 2.17.1.1
Compiled with pandoc-types 1.22.1, texmath 0.12.4, skylighting 0.12.3,
citeproc 0.6.0.1, ipynb 0.2

$ uname -a
Darwin unknownACBC32A26F85 18.7.0 Darwin Kernel Version 18.7.0: Tue Jun 22 19:37:08 PDT 2021; root:xnu-4903.278.70~1/RELEASE_X86_64 x86_64
@mb21
Copy link
Collaborator

mb21 commented Feb 24, 2022

I think &dollar; is just an HTML entity code that's not valid in XML?

@tarleb tarleb removed the bug label Feb 24, 2022
@tarleb
Copy link
Collaborator

tarleb commented Feb 24, 2022

Right, &dollar; is neither a default XML entity, nor is it defined in the DocBook DTD. You can use &#x0024; instead.

@tarleb tarleb closed this as completed Feb 24, 2022
@andreasabel
Copy link
Author

@tarleb wrote:

nor is it defined in the DocBook DTD.

It is actually defined, in file: https://docbook.org/xml/4.2/ent/iso-num.ent

Is the DTD given in the xml file ignored by pandoc?

"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"

@tarleb tarleb reopened this Feb 24, 2022
@tarleb
Copy link
Collaborator

tarleb commented Feb 24, 2022

Oh, I missed that. Yes, I think we ignore currently ignore the DTD.

@andreasabel
Copy link
Author

Ok, thanks @tarleb, for the investigation. Then I suppose the short-term solution is to work around this by search-and-replace...

@jgm
Copy link
Owner

jgm commented Feb 24, 2022

We can hard-code support for these entities for docbook:

<!-- iso-num.ent (initially distributed with DocBook XML DTD V4.1.1beta1) -->

<!-- Derived from the corresponding ISO 8879 standard entity set
     and the Unicode character mappings provided by Sebastian Rahtz -->

<!ENTITY half	"&#x00BD;"> <!-- VULGAR FRACTION ONE HALF -->
<!ENTITY frac12	"&#x00BD;"> <!-- VULGAR FRACTION ONE HALF -->
<!ENTITY frac14	"&#x00BC;"> <!-- VULGAR FRACTION ONE QUARTER -->
<!ENTITY frac34	"&#x00BE;"> <!-- VULGAR FRACTION THREE QUARTERS -->
<!ENTITY frac18	"&#x215B;"> <!--  -->
<!ENTITY frac38	"&#x215C;"> <!--  -->
<!ENTITY frac58	"&#x215D;"> <!--  -->
<!ENTITY frac78	"&#x215E;"> <!--  -->
<!ENTITY sup1	"&#x00B9;"> <!-- SUPERSCRIPT ONE -->
<!ENTITY sup2	"&#x00B2;"> <!-- SUPERSCRIPT TWO -->
<!ENTITY sup3	"&#x00B3;"> <!-- SUPERSCRIPT THREE -->
<!ENTITY plus	"&#x002B;"> <!-- PLUS SIGN -->
<!ENTITY plusmn	"&#x00B1;"> <!-- PLUS-MINUS SIGN -->
<!ENTITY lt	"&#38;#60;"> <!-- LESS-THAN SIGN -->
<!ENTITY equals	"&#x003D;"> <!-- EQUALS SIGN -->
<!ENTITY gt	"&#x003E;"> <!-- GREATER-THAN SIGN -->
<!ENTITY divide	"&#x00F7;"> <!-- DIVISION SIGN -->
<!ENTITY times	"&#x00D7;"> <!-- MULTIPLICATION SIGN -->
<!ENTITY curren	"&#x00A4;"> <!-- CURRENCY SIGN -->
<!ENTITY pound	"&#x00A3;"> <!-- POUND SIGN -->
<!ENTITY dollar	"&#x0024;"> <!-- DOLLAR SIGN -->
<!ENTITY cent	"&#x00A2;"> <!-- CENT SIGN -->
<!ENTITY yen	"&#x00A5;"> <!-- YEN SIGN -->
<!ENTITY num	"&#x0023;"> <!-- NUMBER SIGN -->
<!ENTITY percnt	"&#x0025;"> <!-- PERCENT SIGN -->
<!ENTITY amp	"&#38;#38;"> <!-- AMPERSAND -->
<!ENTITY ast	"&#x002A;"> <!-- ASTERISK OPERATOR -->
<!ENTITY commat	"&#x0040;"> <!-- COMMERCIAL AT -->
<!ENTITY lsqb	"&#x005B;"> <!-- LEFT SQUARE BRACKET -->
<!ENTITY bsol	"&#x005C;"> <!-- REVERSE SOLIDUS -->
<!ENTITY rsqb	"&#x005D;"> <!-- RIGHT SQUARE BRACKET -->
<!ENTITY lcub	"&#x007B;"> <!-- LEFT CURLY BRACKET -->
<!ENTITY horbar	"&#x2015;"> <!-- HORIZONTAL BAR -->
<!ENTITY verbar	"&#x007C;"> <!-- VERTICAL LINE -->
<!ENTITY rcub	"&#x007D;"> <!-- RIGHT CURLY BRACKET -->
<!ENTITY micro	"&#x00B5;"> <!-- MICRO SIGN -->
<!ENTITY ohm	"&#x2126;"> <!-- OHM SIGN -->
<!ENTITY deg	"&#x00B0;"> <!-- DEGREE SIGN -->
<!ENTITY ordm	"&#x00BA;"> <!-- MASCULINE ORDINAL INDICATOR -->
<!ENTITY ordf	"&#x00AA;"> <!-- FEMININE ORDINAL INDICATOR -->
<!ENTITY sect	"&#x00A7;"> <!-- SECTION SIGN -->
<!ENTITY para	"&#x00B6;"> <!-- PILCROW SIGN -->
<!ENTITY middot	"&#x00B7;"> <!-- MIDDLE DOT -->
<!ENTITY larr	"&#x2190;"> <!-- LEFTWARDS DOUBLE ARROW -->
<!ENTITY rarr	"&#x2192;"> <!-- RIGHTWARDS DOUBLE ARROW -->
<!ENTITY uarr	"&#x2191;"> <!-- UPWARDS ARROW -->
<!ENTITY darr	"&#x2193;"> <!-- DOWNWARDS ARROW -->
<!ENTITY copy	"&#x00A9;"> <!-- COPYRIGHT SIGN -->
<!ENTITY reg	"&#x00AE;"> <!-- REG TRADE MARK SIGN -->
<!ENTITY trade	"&#x2122;"> <!-- TRADE MARK SIGN -->
<!ENTITY brvbar	"&#x00A6;"> <!-- BROKEN BAR -->
<!ENTITY not	"&#x00AC;"> <!-- NOT SIGN -->
<!ENTITY sung	"&#x2669;"> <!--  -->
<!ENTITY excl	"&#x0021;"> <!-- EXCLAMATION MARK -->
<!ENTITY iexcl	"&#x00A1;"> <!-- INVERTED EXCLAMATION MARK -->
<!ENTITY quot	"&#x0022;"> <!-- QUOTATION MARK -->
<!ENTITY apos	"&#x0027;"> <!-- APOSTROPHE -->
<!ENTITY lpar	"&#x0028;"> <!-- LEFT PARENTHESIS -->
<!ENTITY rpar	"&#x0029;"> <!-- RIGHT PARENTHESIS -->
<!ENTITY comma	"&#x002C;"> <!-- COMMA -->
<!ENTITY lowbar	"&#x005F;"> <!-- LOW LINE -->
<!ENTITY hyphen	"&#x002D;"> <!-- HYPHEN-MINUS -->
<!ENTITY period	"&#x002E;"> <!-- FULL STOP -->
<!ENTITY sol	"&#x002F;"> <!-- SOLIDUS -->
<!ENTITY colon	"&#x003A;"> <!-- COLON -->
<!ENTITY semi	"&#x003B;"> <!-- SEMICOLON -->
<!ENTITY quest	"&#x003F;"> <!-- QUESTION MARK -->
<!ENTITY iquest	"&#x00BF;"> <!-- INVERTED QUESTION MARK -->
<!ENTITY laquo	"&#x00AB;"> <!-- LEFT-POINTING DOUBLE ANGLE QUOTATION MARK -->
<!ENTITY raquo	"&#x00BB;"> <!-- RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK -->
<!ENTITY lsquo	"&#x2018;"> <!--  -->
<!ENTITY rsquo	"&#x2019;"> <!-- RIGHT SINGLE QUOTATION MARK -->
<!ENTITY ldquo	"&#x201C;"> <!--  -->
<!ENTITY rdquo	"&#x201D;"> <!-- RIGHT DOUBLE QUOTATION MARK -->
<!ENTITY nbsp	"&#x00A0;"> <!-- NO-BREAK SPACE -->
<!ENTITY shy	"&#x00AD;"> <!-- SOFT HYPHEN -->

@jgm
Copy link
Owner

jgm commented Feb 24, 2022

Note to self: we need to use the conduit-xml option psDecodeEntities -- possibly adding a new function to T.P.XML.Light.

@tarleb
Copy link
Collaborator

tarleb commented Feb 24, 2022

From what I understand after reading a bit more on the topic: DocBook 5 only defines the standard XML entities (hence my original statement); DocBook 4 however contains all of these: https://www.w3.org/2003/entities/2007doc/byalpha.html

@jgm jgm closed this as completed in 5375bd1 Feb 24, 2022
@andreasabel
Copy link
Author

Thank you mucho mucho! I confirm the problem to be fixed for happy.xml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants