Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse error for comments containing special characters inside a DOCTYPE declaration #280

Open
okeeblow opened this issue Sep 2, 2021 · 1 comment

Comments

@okeeblow
Copy link
Contributor

okeeblow commented Sep 2, 2021

Here's a corner case I ran into when trying to parse the chemical-mime project's XML:

irb(main):002:0> Ox::VERSION
=> "2.14.5"
irb(main):003:0> Ox::load_file('/home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/mime/packages/third-party/chemical-mime/chemical-mime-database.xml.in')
(irb):3:in `load_file': invalid format, dectype not terminated at line 1606, column 18 [parse.c:339] (Ox::ParseError)
        from (irb):3:in `<main>'
        from ./bin/repl:11:in `<main>'

That line 1606 is the end of the file, so the parser ran away searching for an end-symbol that never occurred. The trigger is an XML comment inside a <!DOCTYPE declaration when that comment contains one of the characters ", ', [, or <. In my case it was the less-than symbol seen right here on line 81: https://github.com/dleidert/chemical-mime/blob/4fd66e3b3b7d922555d1e25587908b036805c45b/src/chemical-mime-database.xml.in#L81

This does seems to be valid XML syntax according to my reading of the spec: "[comments] may appear within the document type declaration at places allowed by the grammar" https://www.w3.org/TR/REC-xml/#sec-comments

The cause within Ox is the handling of those four special characters in parse.c's and sax.c'sread_delimited (both versions) as called by read_doctype:

  • ox/ext/ox/parse.c

    Lines 341 to 352 in 337b082

    case '"':
    read_delimited(pi, c);
    break;
    case '\'':
    read_delimited(pi, c);
    break;
    case '[':
    read_delimited(pi, ']');
    break;
    case '<':
    read_delimited(pi, '>');
    break;
  • ox/ext/ox/sax.c

    Lines 572 to 583 in 5cc4bff

    case '"':
    c = read_delimited(dr, c);
    break;
    case '\'':
    c = read_delimited(dr, c);
    break;
    case '[':
    c = read_delimited(dr, ']');
    break;
    case '<':
    c = read_delimited(dr, '>');
    break;

I narrowed it down to some very basic test cases. The simplest case of comment-inside-doctype without special characters does parse successfully but leaves the comment as plain text in the Ox::DocType's value:

irb(main):001:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment -->\n] l>")
=> 
#<Ox::Document:0x0000558c20da4380
 @attributes={:version=>"1.0", :encoding=>"UTF-8"},
 @nodes=[#<Ox::DocType:0x0000558c20da4268 @value="hey [\n<!-- this is a comment -->\n] l">]>

The other four all fail:

irb(main):002:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a < character -->\n] l>")
(irb):2:in `load': invalid format, dectype not terminated at line 4, column 8 [parse.c:339] (Ox::ParseError)
irb(main):003:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a \" character -->\n] l>")
(irb):3:in `load': invalid format, dectype not terminated at line 4, column 10 [parse.c:339] (Ox::ParseError)
irb(main):004:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a ' character -->\n] l>")
(irb):4:in `load': invalid format, dectype not terminated at line 4, column 10 [parse.c:339] (Ox::ParseError)
irb(main):005:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a [ character -->\n] l>")
(irb):5:in `load': invalid format, dectype not terminated at line 4, column 8 [parse.c:339] (Ox::ParseError)

I attempted to patch this myself but got a little lost trying to understand the interaction between the PInfo's position pointer and text value, and then very lost with the SAX parser's buffer checkpoint/checkback handling :p

For posterity here is as far as I got:

diff --git a/ext/ox/parse.c b/ext/ox/parse.c
index e9e28b3..0aa10ed 100644
--- a/ext/ox/parse.c
+++ b/ext/ox/parse.c
@@ -330,6 +330,11 @@ read_delimited(PInfo pi, char end) {
        }
     } else {
        while (1) {
+      if (0 == strncmp("<!--", pi->s, 4)) {
+    pi->s += 4;
+    read_comment(pi);
+    break;
+      }
            c = *pi->s++;
            if (end == c) {
                return;

Which fixed only the non-SAX parsing, still resulted in the comment text getting added to the DocType's value, and appended the new Comment to the top-level Document instead of to the DocType:

irb(main):002:0> Ox::load("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE hey [\n<!-- this is a comment containing a < character -->\n] l>")
=>                                                                                       
#<Ox::Document:0x0000563eec630d90                                                                                                 
 @attributes={:version=>"1.0", :encoding=>"UTF-8"},                                                                                                    
 @nodes=                                                                                                                          
  [#<Ox::Comment:0x0000563eec630c28 @value="this is a comment containing a < character">,                                         
   #<Ox::DocType:0x0000563eec630bd8 @value="hey [\n<!-- this is a comment containing a < character">]>

For my purposes I don't actually need the contents of any of these doctype comments, just for the rest of the file to parse successfully with my SAX parser. Feel free to consider this one the lowest of low-priority though :)

@ohler55
Copy link
Owner

ohler55 commented Sep 2, 2021

I'll take a look and get it fixed. It certainly is an edge case. I appreciate the detective work too.

okeeblow added a commit to okeeblow/DistorteD that referenced this issue Sep 6, 2021
Preparation for importing the Chemical MIME project's `shared-mime-info` package pending ohler55/ox#280 (totally no rush lol)

Minor re-wording to reflect the fact that not all primary types are IANA-approved.
okeeblow added a commit to okeeblow/DistorteD that referenced this issue Sep 18, 2021
Preparation for importing the Chemical MIME project's `shared-mime-info` package pending ohler55/ox#280 (totally no rush lol)

Minor re-wording to reflect the fact that not all primary types are IANA-approved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants