Skip to content
This repository has been archived by the owner on Oct 1, 2018. It is now read-only.

83 Further optimize declaration parsing #87

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

marschall
Copy link
Contributor

After SAAJ-81 there is still some optimization potential left in XML
declaration parsing. XMLDeclarationParser seems to require a
PushbackReader with a buffer size of 4096.This seems excessive to
unread a declaration which is 35 characters in the common case. However
reducing the buffer size causes tests to fail. Upon closer inspection
the parsing algorithm seem to be:

  1. read the entire first node
  2. check if the node starts with <?xml, if not unread the entire node

This has two issues. The first issue is a functional one. The algorithm
works only if the buffer is large enough to unread the entire node. The
tests contain some XML files where the first node is a large comment
node. This seem to be where the number 4096 comes from. However it
could easily that the first node is larger and then parsing will fails
with an exception. The second issue is related to performance. After
reading 12 characters we can already decide if the node is an XML
declaration. If not we can stop right there. This reduces the required
size of the buffer in the PushbackReader to 12. More than two orders of
magnitude smaller. Based on this the algorithm we propose is:

  1. read up to 12 characters or the whole node which ever happens first
  2. check if the node starts with <?xml, if not unread what you have
    read

The number 12 comes from the fact that <?xml in UTF-16 with a BOM is 12
bytes.

Issue: #83

After SAAJ-81 there is still some optimization potential left in XML
declaration parsing. XMLDeclarationParser seems to require a
PushbackReader with a buffer size of 4096.This seems excessive to
unread a declaration which is 35 characters in the common case. However
reducing the buffer size causes tests to fail. Upon closer inspection
the parsing algorithm seem to be:

1. read the entire first node
2. check if the node starts with <?xml, if not unread the entire node

This has two issues. The first issue is a functional one. The algorithm
works only if the buffer is large enough to unread the entire node. The
tests contain some XML files where the first node is a large comment
node. This seem to be where the number 4096 comes from. However it
could easily that the first node is larger and then parsing will fails
with an exception. The second issue is related to performance. After
reading 12 characters we can already decide if the node is an XML
declaration. If not we can stop right there. This reduces the required
size of the buffer in the PushbackReader to 12. More than two orders of
magnitude smaller. Based on this the algorithm we propose is:

1. read up to 12 characters or the whole node which ever happens first
2. check if the node starts with <?xml, if not unread what you have
   read

The number 12 comes from the fact that <?xml in UTF-16 with a BOM is 12
bytes.

Issue: javaee#83
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant