Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix XML spec whitespace compliance #1

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mickeyl
Copy link

@mickeyl mickeyl commented Sep 13, 2013

To ease CDATA processing, TBXML used to use a trick where after
moving the actual content 'over' the start of the CDATA section,
the remaning characters to the right are been overwritten with whitespace.
Lateron, TBXML cleans all the whitespace within(!) the text content,
thus removing the whitespace it added in the first step.

However, removing significant whitespace (enclosed within the text
portion of a tag) is against the spec. It's unfortunately not enough
to remove said portions of code, but you also need to catch up with the
CDATA trick. My approach is to indicate the end of the text with a \0
marker and setting the elementStart appropriately to continue searching.

As a neat side-effect, this also slightly improves the parsing speed.

To ease CDATA processing, TBXML used to use a trick where after
moving the actual content 'over' the start of the CDATA section,
the remaning characters to the right are been overwritten with whitespace.
Lateron, TBXML cleans all the whitespace within(!) the text content,
thus removing the whitespace it added in the first step.

However, removing significant whitespace (enclosed within the text
portion of a tag) is against the spec. It's unfortunately not enough
to remove said portions of code, but you also need to catch up with the
CDATA trick. My approach is to indicate the end of the text with a \0
marker and setting the elementStart appropriately to continue searching.

As a neat side-effect, this also slightly improves the parsing speed.
@mickeyl
Copy link
Author

mickeyl commented Mar 12, 2014

Thanks for your comment. In order to fix another oddity, I had to patch the computation of the elementStart to something more simple – I believe this also fixes your issue with multiple CDATA sections back-to-back. I will run a test asap.

Unfortunately some servers still send ascii, macosroman, or similar encodings –
in such cases, TBXML parses null instead of the right strings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants