Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Handle CDATA properly #273

Merged
merged 14 commits into from
Feb 17, 2015
Merged

Handle CDATA properly #273

merged 14 commits into from
Feb 17, 2015

Conversation

nostrademons
Copy link
Contributor

This changes the tokenizer so that CDATA sections are properly plumbed through to the parser, so that the parser actually produces GUMBO_NODE_CDATA nodes. Fixes #266. @craigbarnes, would you like to review?

@craigbarnes
Copy link
Contributor

I get one new html5lib test failure with this branch:

======================================================================
FAIL: test_plain-text-unsafe_11 (__main__.Html5libAdapterTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "python/gumbo/html5lib_adapter_test.py", line 155, in test_func
    return self.impl(inner_html, input, expected, errors)
  File "python/gumbo/html5lib_adapter_test.py", line 126, in impl
    error_msg.encode('ascii', 'xmlcharrefreplace') + '\n')
AssertionError: 

Input:
<svg><![CDATA[fillertext]]>

Expected:
<html html>
  <html head>
  <html body>
    <svg svg>
      "&#65533;filler&#65533;text&#65533;"

Received:
<html html>
  <html head>
  <html body>
    <svg svg>

----------------------------------------------------------------------
Ran 1326 tests in 1.291s

FAILED (failures=1)

This is as reported by html5lib_adapter_test.py using html5lib/html5lib-tests@1d3a787. I also checked with the Lua test runner, which fails on the same test.

Not sure which revision of html5lib-tests I should be using for this branch though...

@craigbarnes
Copy link
Contributor

Just tried using html5lib/html5lib-tests@80fc271 (current master) and it's the same there.

Also noticed a new warning, FWIW:

src/error.c:99:3: warning: enumeration value 'GUMBO_TOKEN_CDATA' not handled in switch [-Wswitch]

@nostrademons
Copy link
Contributor Author

I've updated this pull request to fix the failing test case and compiler warning. I'll let it sit for a bit, but plan to merge fairly soon because I'd like to get it in to master before merging @vmg and @kevinhendricks changes in to a v0.10 branch. Hopefully somebody can get around to reviewing it shortly.

@vmg
Copy link
Contributor

vmg commented Feb 17, 2015

Looks good to me!

@aroben
Copy link
Contributor

aroben commented Feb 17, 2015

As far as I can tell from the spec [1], tokenizing a CDATA section just emits the text contained within the section and never produces a "CDATA node". So it doesn't seem like GUMBO_NODE_CDATA should exist at all. Am I missing something?

@kevinhendricks
Copy link
Contributor

@aroben
I think the idea was to allow a serializer to reproduce the tree as closely as possible to the original. Without a CDATA node type, this information would be lost to the tree-walkers/serializers.

On Feb 17, 2015, at 11:24 AM, Adam Roben notifications@github.com wrote:

As far as I can tell from the spec [1], tokenizing a CDATA section just emits the text contained within the section and never produces a "CDATA node". So it doesn't seem like GUMBO_NODE_CDATA should exist at all. Am I missing something?


Reply to this email directly or view it on GitHub.

@aroben
Copy link
Contributor

aroben commented Feb 17, 2015

I see, so GUMBO_NODE_CDATA is like GUMBO_NODE_WHITESPACE: it's really just a text node, but with a tiny bit of extra information.

@kevinhendricks
Copy link
Contributor

@aroben - yes and nicely only if you care to use that extra information. So it can not hurt so to speak.

On Feb 17, 2015, at 11:32 AM, Adam Roben notifications@github.com wrote:

I see, so GUMBO_NODE_CDATA is like GUMBO_NODE_WHITESPACE: it's really just a text node, but with a tiny bit of extra information.


Reply to this email directly or view it on GitHub.

@nostrademons
Copy link
Contributor Author

Discussion was in bug #266. GUMBO_NODE_CDATA already existed in gumbo.h, but was never actually being set. Given gumbo's primary audience of tool writers and the existence of GUMBO_NODE_WHITESPACE already, we felt the extra node type would give the most useful information to client code.

nostrademons added a commit that referenced this pull request Feb 17, 2015
@nostrademons nostrademons merged commit 7a55fdc into google:master Feb 17, 2015
@nostrademons nostrademons deleted the cdata branch February 17, 2015 20:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GUMBO_NODE_CDATA unused?
5 participants