Skip to content

Latest commit

 

History

History
101 lines (66 loc) · 2.06 KB

parsing.md

File metadata and controls

101 lines (66 loc) · 2.06 KB

roxmltree parsing strategy

XML parsing is hard. Everyone knows that. But the other problem is that it can be represented in very different ways:

  • You can preserve comment or ignore them completely or partially.
  • You can represent text data as a separated node or embed it into the element node.
  • You can keep CDATA as a separated node or merge it into the text node.
  • You can preserve XML declaration or ignore it completely.
  • ... and many more.

This document explains how roxmltree parses and represents the XML document.

XML declaration

XML declaration is completely ignored. Mostly because it doesn't contain any valuable information for us.

  • version is expected to be 1.*. Otherwise an error will occur.
  • encoding is irrelevant since we are parsing only valid UTF-8 strings.
  • And no one really follow the standalone constraints.

DTD

Only ENTITY objects will be resolved. Everything else will be ignored at the moment.

<!DOCTYPE test [
    <!ENTITY a 'text<p/>text'>
]>
<e>&a;</e>

will be parsed into:

<e>text<p/>text</e>

Were p is an element, not a text.

Comments

All comment will be preserved.

Processing instructions

All processing instructions will be preserved.

Whitespaces

All whitespaces inside the root element will be preserved.

<p>
    text
</p>

it will be parsed as \n␣␣␣␣text\n.

Same goes to an escaped one:

<p>&#x20;&#x20;text&#x20;&#x20;</p>

it will be parsed as ␣␣text␣␣.

CDATA

CDATA will be embedded to a text node:

<p>t<![CDATA[e&#x20;]]>&#x20;x<![CDATA[t]]></p>

it will be parsed as te&#x20; xt.

Text

Text will be unescaped. All entity references will be resolved.

<!DOCTYPE test [
    <!ENTITY b 'Some&#x20;text'>
]>
<p>&b;</p>

it will be parsed as Some text.

Attribute-Value Normalization

Attribute-Value Normalization works as explained in the spec.

Namespaces resolving

roxmltree has a complete support for XML namespaces.