fix XML spec whitespace compliance #1

mickeyl · 2013-09-13T15:36:53Z

To ease CDATA processing, TBXML used to use a trick where after
moving the actual content 'over' the start of the CDATA section,
the remaning characters to the right are been overwritten with whitespace.
Lateron, TBXML cleans all the whitespace within(!) the text content,
thus removing the whitespace it added in the first step.

However, removing significant whitespace (enclosed within the text
portion of a tag) is against the spec. It's unfortunately not enough
to remove said portions of code, but you also need to catch up with the
CDATA trick. My approach is to indicate the end of the text with a \0
marker and setting the elementStart appropriately to continue searching.

As a neat side-effect, this also slightly improves the parsing speed.

To ease CDATA processing, TBXML used to use a trick where after moving the actual content 'over' the start of the CDATA section, the remaning characters to the right are been overwritten with whitespace. Lateron, TBXML cleans all the whitespace within(!) the text content, thus removing the whitespace it added in the first step. However, removing significant whitespace (enclosed within the text portion of a tag) is against the spec. It's unfortunately not enough to remove said portions of code, but you also need to catch up with the CDATA trick. My approach is to indicate the end of the text with a \0 marker and setting the elementStart appropriately to continue searching. As a neat side-effect, this also slightly improves the parsing speed.

mickeyl · 2014-03-12T15:51:51Z

Thanks for your comment. In order to fix another oddity, I had to patch the computation of the elementStart to something more simple – I believe this also fixes your issue with multiple CDATA sections back-to-back. I will run a test asap.

Unfortunately some servers still send ascii, macosroman, or similar encodings – in such cases, TBXML parses null instead of the right strings.

mickeyl mentioned this pull request Nov 2, 2013

TBXML Whitespace Handling codebots-ltd/TBXML#20

Open

fix parsing CDATA sections where the content starts with '<'

2ec549a

mickeyl added 4 commits November 7, 2014 12:37

tbxml.h: add missing foundation include

7daa31e

add a way to modify the global assumed encoding

d5d4ee2

Unfortunately some servers still send ascii, macosroman, or similar encodings – in such cases, TBXML parses null instead of the right strings.

initialize global encoding with UTF8

7742e86

gitignore++

ec13280

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix XML spec whitespace compliance #1

fix XML spec whitespace compliance #1

mickeyl commented Sep 13, 2013

mickeyl commented Mar 12, 2014

fix XML spec whitespace compliance #1

Are you sure you want to change the base?

fix XML spec whitespace compliance #1

Conversation

mickeyl commented Sep 13, 2013

mickeyl commented Mar 12, 2014