New XML parser #9573

yurydelendik · 2018-03-16T21:55:20Z

Improves performance of the RegExp based SimpleXMLParser.

The code for XMLParserBase copied from
https://github.com/mozilla/shumway/blob/16451d8836fa85f4b16eeda8b4bda2fa9e2b22b0/src/avm2/natives/xml.ts

Snuffleupagus · 2018-03-16T22:17:57Z

src/display/xml_parser.js

+        case 'amp':
+          return '&';
+        case 'quot':
+          return '\"';


The old code also handled apos, see https://github.com/mozilla/pdf.js/pull/9573/files#diff-3cb21ac6c8ce15118704e14df3dcc9b2L256; is that no longer necessary here?

It's HTML thing, but we can add it there too

timvandermeij · 2018-03-19T21:43:11Z

/botio test

pdfjsbot · 2018-03-19T21:43:12Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/4a8eb8977cfd151/output.txt

pdfjsbot · 2018-03-19T21:43:12Z

From: Bot.io (Windows)

Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/0638ec9c7e0b33e/output.txt

pdfjsbot · 2018-03-19T22:01:22Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/4a8eb8977cfd151/output.txt

Total script time: 18.17 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2018-03-19T22:07:23Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/0638ec9c7e0b33e/output.txt

Total script time: 24.16 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

timvandermeij · 2018-03-19T22:12:19Z

/botio-linux preview

pdfjsbot · 2018-03-19T22:12:20Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/80445c7126554ab/output.txt

pdfjsbot · 2018-03-19T22:15:02Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/80445c7126554ab/output.txt

Total script time: 2.70 mins

Published

timvandermeij

In general this looks good to me (since it already worked in Shumway, I glanced over the details of the XML parsing), but there are some final comments I'd like to see addressed before approving this. Thanks!

timvandermeij · 2018-03-19T22:19:05Z

src/display/xml_parser.js

+
+function isWhitespaceString(s) {
+  for (let i = 0; i < s.length; i++) {
+    const ch = s[i];


Let's just call isWhitespace(s[i]) here so we don't have to copy the logic. Also, let's cache s.length using for (let i = 0, ii = s.length; i < ii; i++) like we do elsewhere in the codebase.

timvandermeij · 2018-03-19T22:19:17Z

src/display/metadata.js

@@ -23,13 +23,15 @@ class Metadata {
    // Ghostscript may produce invalid metadata, so try to repair that first.
    data = this._repair(data);

-    // Convert the string to a DOM `Document`.
+    // Convert the string to a XML document.


Nit: an XML document

timvandermeij · 2018-03-19T22:21:11Z

src/display/xml_parser.js

+    name = s.substring(start, pos);
+    skipWs();
+    while (pos < s.length && s[pos] !== '>' &&
+    s[pos] !== '/' && s[pos] !== '?') {


This should to be aligned with pos for readability.

timvandermeij · 2018-03-19T22:22:54Z

src/display/xml_parser.js

+              const doctypeContent =
+                s.substring(j + 8, q + (complexDoctype ? 1 : 0));
+              this.onDoctype(doctypeContent);
+              // XXX pull entities ?


What does this mean? Can we remove it?

timvandermeij · 2018-03-19T22:24:01Z

src/display/xml_parser.js

+      } else {
+        do {
+          /* skipping text items */
+        } while (j++ < s.length && s[j] !== '<');


Do we need a do .. while loop here instead of a regular while loop as used above?

timvandermeij · 2018-03-19T22:27:08Z

src/display/xml_parser.js

+  }
+}
+
+export class SimpleXMLParser extends XMLParserBase {


Usually we put the exports at the bottom of the file. Can we do that here too for consistency (compare https://github.com/mozilla/pdf.js/pull/9573/files#diff-3cb21ac6c8ce15118704e14df3dcc9b2L405)?

timvandermeij · 2018-03-19T22:29:14Z

src/display/xml_parser.js

+    super();
+    this._currentFragment = null;
+    this._stack = null;
+    this._errorCode = XMLParserErrorCode.NoError;


parseFromString sets this too, so should we just set it too null here too like the other two members?

I just don't want to change shape of the object by keep _errorCode a number. I can add XMLParserErrorCode.Unknown value?

I see. In that case, let's keep it like how it is now for simplicity. Thanks!

timvandermeij · 2018-03-20T22:49:49Z

/botio test

pdfjsbot · 2018-03-20T22:49:50Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/bd3c487b19dd6fb/output.txt

pdfjsbot · 2018-03-20T22:49:50Z

From: Bot.io (Windows)

Received

Command cmd_test from @timvandermeij received. Current queue size: 1

Live output at: http://54.215.176.217:8877/81d163b152949eb/output.txt

pdfjsbot · 2018-03-20T23:08:03Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/bd3c487b19dd6fb/output.txt

Total script time: 18.20 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2018-03-20T23:15:17Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/81d163b152949eb/output.txt

Total script time: 24.27 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

Rob--W · 2018-03-22T16:10:57Z

Why is a custom XML parser used instead of the standard DOMParser API ( https://developer.mozilla.org/en-US/docs/Web/API/DOMParser )? All web browsers, even IE 9+ support this API.

Snuffleupagus · 2018-03-24T12:01:37Z

Why is a custom XML parser used instead of the standard DOMParser API ( https://developer.mozilla.org/en-US/docs/Web/API/DOMParser )? All web browsers, even IE 9+ support this API.

Please see issue #8903; and also https://bugzilla.mozilla.org/show_bug.cgi?id=1386676.

New XML parser

yurydelendik force-pushed the xml_parser branch 2 times, most recently from 9fccb49 to ec77328 Compare March 16, 2018 22:08

timvandermeij added the core label Mar 16, 2018

Snuffleupagus reviewed Mar 16, 2018

View reviewed changes

yurydelendik force-pushed the xml_parser branch from ec77328 to 06452bf Compare March 16, 2018 22:26

timvandermeij reviewed Mar 19, 2018

View reviewed changes

New XML parser

655c8d3

yurydelendik force-pushed the xml_parser branch from 06452bf to 655c8d3 Compare March 20, 2018 01:52

brendandahl merged commit 24f766b into mozilla:master Mar 22, 2018

movsb pushed a commit to movsb/pdf.js that referenced this pull request Jul 14, 2018

Merge pull request mozilla#9573 from yurydelendik/xml_parser

0325286

New XML parser

Snuffleupagus mentioned this pull request Jul 18, 2018

Prevent Metadata/XML parsing from breaking PDFDocumentProxy.getMetadata when no XML root document is found (issue 8884) #9900

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New XML parser #9573

New XML parser #9573

yurydelendik commented Mar 16, 2018 •

edited

Loading

Snuffleupagus Mar 16, 2018

yurydelendik Mar 16, 2018

timvandermeij commented Mar 19, 2018

pdfjsbot commented Mar 19, 2018

pdfjsbot commented Mar 19, 2018

pdfjsbot commented Mar 19, 2018

pdfjsbot commented Mar 19, 2018

timvandermeij commented Mar 19, 2018

pdfjsbot commented Mar 19, 2018

pdfjsbot commented Mar 19, 2018

timvandermeij left a comment

timvandermeij Mar 19, 2018

timvandermeij Mar 19, 2018

timvandermeij Mar 19, 2018

timvandermeij Mar 19, 2018

timvandermeij Mar 19, 2018

timvandermeij Mar 19, 2018

timvandermeij Mar 19, 2018

yurydelendik Mar 20, 2018

timvandermeij Mar 20, 2018

timvandermeij commented Mar 20, 2018

pdfjsbot commented Mar 20, 2018

pdfjsbot commented Mar 20, 2018

pdfjsbot commented Mar 20, 2018

pdfjsbot commented Mar 20, 2018

Rob--W commented Mar 22, 2018

Snuffleupagus commented Mar 24, 2018

New XML parser #9573

New XML parser #9573

Conversation

yurydelendik commented Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timvandermeij commented Mar 19, 2018

pdfjsbot commented Mar 19, 2018

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 19, 2018

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 19, 2018

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 19, 2018

From: Bot.io (Windows)

Success

timvandermeij commented Mar 19, 2018

pdfjsbot commented Mar 19, 2018

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 19, 2018

From: Bot.io (Linux m4)

Success

Published

timvandermeij left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timvandermeij commented Mar 20, 2018

pdfjsbot commented Mar 20, 2018

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 20, 2018

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 20, 2018

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 20, 2018

From: Bot.io (Windows)

Success

Rob--W commented Mar 22, 2018

Snuffleupagus commented Mar 24, 2018

yurydelendik commented Mar 16, 2018 •

edited

Loading