XML parsing is hard. Everyone knows that. But the other problem is that it can be represented in very different ways:
- You can preserve comment or ignore them completely or partially.
- You can represent text data as a separated node or embed it into the element node.
- You can keep CDATA as a separated node or merge it into the text node.
- You can preserve XML declaration or ignore it completely.
- ... and many more.
This document explains how roxmltree parses and represents the XML document.
XML declaration is completely ignored. Mostly because it doesn't contain any valuable information for us.
version
is expected to be1.*
. Otherwise an error will occur.encoding
is irrelevant since we are parsing only valid UTF-8 strings.- And no one really follow the
standalone
constraints.
Only ENTITY
objects will be resolved. Everything else will be ignored
at the moment.
<!DOCTYPE test [
<!ENTITY a 'text<p/>text'>
]>
<e>&a;</e>
will be parsed into:
<e>text<p/>text</e>
Were p
is an element, not a text.
All comment will be preserved.
All processing instructions will be preserved.
All whitespaces inside the root element will be preserved.
<p>
text
</p>
it will be parsed as \n␣␣␣␣text\n
.
Same goes to an escaped one:
<p>  text  </p>
it will be parsed as ␣␣text␣␣
.
CDATA will be embedded to a text node:
<p>t<![CDATA[e ]]> x<![CDATA[t]]></p>
it will be parsed as te  xt
.
Text will be unescaped. All entity references will be resolved.
<!DOCTYPE test [
<!ENTITY b 'Some text'>
]>
<p>&b;</p>
it will be parsed as Some text
.
Attribute-Value Normalization works as explained in the spec.
roxmltree has a complete support for XML namespaces.