Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Matching on local names instead of (ns, name) pair causes assertion from close_current_cell when dealing with math:td #278

Closed
gsnedders opened this issue Dec 3, 2014 · 5 comments · Fixed by #286

Comments

@gsnedders
Copy link
Contributor

More bugs from afl!

The following input:

<table><th><math><td></tr>

Results in:

get_title: src/parser.c:1497: close_current_cell: Assertion `!has_an_element_in_table_scope(parser, GUMBO_TAG_TH)' failed.

When processing the </tr> tag in the parser, the tree looks as follows:

html:html
  html:head
  html:body
    html:table
      html:tbody
        html:tr
          html:th
            math:math
              math:td

The </tr> starts off being processed "in foreign content" before falling back to the current HTML insertion mode ("in cell"). This then calls "close the cell" where Gumbo fails.

This appears to be the common HTML parser bug of matching on local name alone (it should be true that there is never both an html:td and an html:th in table scope at the asme time). I haven't looked at whether #274 fixes this, but it is certainly the case that html5lib-tests has tests for cases similar to this nowadays (though probably doesn't in the 0.95 version that Gumbo currently uses), so #3 is somewhat relevant here.

@gsnedders
Copy link
Contributor Author

This comes about as the matching is done on the tag field of GumboElement, which is set based on the token name by the tokeniser. Inevitably, this doesn't distinguish based on namespace (as the tokeniser has no such concept).

@nostrademons
Copy link
Contributor

Yeah, I noticed this when fixing up the insertion machinery for #274. It does not fix it, however - the changes would be too invasive, since the signature of (node_)tag_i[ns] would have to change and those're used all over the place. It may be time to bite the bullet and do it, but it probably should be in a different branch...the template tag stuff is already really complicated, it'll become nearly impossible to review with this change too.

I'd probably also want to do another round of profiling & optimization before possibly changing the signatures of these. It's always annoyed me that they work by looping over a varargs list, which my intuition tells me ought to be really slow. The round of profiling I did while at Google indicated that the time was dwarfed by UTF-8 decoding and entity references, but now that both of those have been sped up significantly, I wonder how much the tag matching functions contribute to runtime.

@gsnedders
Copy link
Contributor Author

I wonder if GumboElement.tag should be set based on the (ns, name) pair instead, which avoids that problem? In my mind, the field should categorise elements into known elements (html:td, svg:svg, etc.) rather than based purely upon the actual name of the token. This means none of those functions need to change signature, and hopefully avoids users misusing the tag field (because I'd have written code that fell foul of the same bug as here!). I know this means uncoupling it from how it's purely set based on the tokeniser, but I think that wouldn't break anything from my look before?

@gsnedders
Copy link
Contributor Author

#279 turns out to be the same, except being during resetting the insertion mode instead of while popping elements from the stack.

@kevinhendricks
Copy link
Contributor

If anyone is interested, please see the must recent pull request (number 5?) which I think is meant to address this very issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants