Further optimize declaration parsing #83

glassfishrobot · 2015-10-18T13:35:43Z

After #81 there is still some optimization potential left in XML declaration parsing. XMLDeclarationParser seems to require a PushbackReader with a buffer size of 4096.This seems excessive to unread a declaration which is 35 characters in the common case. However reducing the buffer size causes tests to fail. Upon closer inspection the parsing algorithm seem to be:

read the entire first node
check if the node starts with <?xml, if not unread the entire node

This has two issues. The first issue is a functional one. The algorithm works only if the buffer is large enough to unread the entire node. The tests contain some XML files where the first node is a large comment node. This seem to be where the number 4096 comes from. However it could easily that the first node is larger and then parsing will fails with an exception. The second issue is related to performance. After reading 12 characters we can already decide if the node is an XML declaration. If not we can stop right there. This reduces the required size of the buffer in the PushbackReader to 12. More than two orders of magnitude smaller. Based on this the algorithm we propose is:

read up to 12 characters or the whole node which ever happens first
check if the node starts with <?xml, if not unread what you have read

The number 12 comes from the fact that <?xml in UTF-16 with a BOM is 12 bytes.

glassfishrobot · 2015-10-18T13:35:43Z

Reported by braghest

glassfishrobot · 2015-10-18T13:35:43Z

Was assigned to gagordon

glassfishrobot · 2017-04-24T13:03:06Z

This issue was imported from java.net JIRA SAAJ-83

After SAAJ-81 there is still some optimization potential left in XML declaration parsing. XMLDeclarationParser seems to require a PushbackReader with a buffer size of 4096.This seems excessive to unread a declaration which is 35 characters in the common case. However reducing the buffer size causes tests to fail. Upon closer inspection the parsing algorithm seem to be: 1. read the entire first node 2. check if the node starts with <?xml, if not unread the entire node This has two issues. The first issue is a functional one. The algorithm works only if the buffer is large enough to unread the entire node. The tests contain some XML files where the first node is a large comment node. This seem to be where the number 4096 comes from. However it could easily that the first node is larger and then parsing will fails with an exception. The second issue is related to performance. After reading 12 characters we can already decide if the node is an XML declaration. If not we can stop right there. This reduces the required size of the buffer in the PushbackReader to 12. More than two orders of magnitude smaller. Based on this the algorithm we propose is: 1. read up to 12 characters or the whole node which ever happens first 2. check if the node starts with <?xml, if not unread what you have read The number 12 comes from the fact that <?xml in UTF-16 with a BOM is 12 bytes. Issue: javaee#83

glassfishrobot added Type: Bug Priority: Major Component: code ERR: Assignee labels Apr 24, 2017

glassfishrobot self-assigned this Apr 24, 2017

marschall mentioned this issue May 15, 2017

83 Further optimize declaration parsing #87

Open

Tomas-Kraus mentioned this issue Sep 24, 2018

Further optimize declaration parsing eclipse-ee4j/metro-saaj#83

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further optimize declaration parsing #83

Further optimize declaration parsing #83

glassfishrobot commented Oct 18, 2015

glassfishrobot commented Oct 18, 2015

glassfishrobot commented Oct 18, 2015

glassfishrobot commented Apr 24, 2017

Further optimize declaration parsing #83

Further optimize declaration parsing #83

Comments

glassfishrobot commented Oct 18, 2015

glassfishrobot commented Oct 18, 2015

glassfishrobot commented Oct 18, 2015

glassfishrobot commented Apr 24, 2017