[api-minor] Replace `DOMParser` with `SimpleXMLParser` #8912

timvandermeij · 2017-09-15T20:35:45Z

The DOMParser is most likely overkill and may be less secure. Moreover, it is not supported in Node.js environments.

This patch replaces the DOMParser with a simple XML parser. This should be faster and gives us Node.js support for free. The simple XML parser is a port of the one that existed in the examples folder with improved regexes to make the parsing work properly.

The unit tests are extended for increased test coverage of the metadata code. The new method getAll is provided so the example does not have to access internal properties of the object anymore.

Fixes #8903.

Snuffleupagus

Looks like a nice simplification!
One question though: Would it maybe make more sense to place the SimpleXMLParser (and related code) in a utility file instead (such as /src/display/dom_utils.js)?

Snuffleupagus · 2017-09-15T20:47:50Z

src/display/metadata.js

+    do {
+      lastLength = nodes.length;
+      data = data.replace(
+        /<([\w\:]+)((?:[\s\w:=]|'[^']*'|"[^"]*")*)(?:\/>|>([\d,]*)<\/[^>]+>)/g,


Rather than creating a regular expression over and over in a loop, would it make sense to just define it once instead?

Done in the new commit, together with the code move to the utility file. Good points, thank you!

Snuffleupagus · 2017-09-15T21:11:23Z

src/display/metadata.js

+
+      // Convert the string to a DOM `Document`.
+      let parser = new SimpleXMLParser();
+      data = parser.parseFromString(data, 'application/xml');


Nit: It appears that the parseFromString method (of the SimpleXMLParser) only takes one argument, so presumably we no longer need to pass in 'application/xml' unless I'm missing something here!?

D'oh! You're right, since it's now specialized to an XML parser.

timvandermeij · 2017-09-15T21:29:16Z

/botio-linux preview

pdfjsbot · 2017-09-15T21:29:17Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/65ed2537014a760/output.txt

pdfjsbot · 2017-09-15T21:31:37Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/65ed2537014a760/output.txt

Total script time: 2.32 mins

Published

timvandermeij · 2017-09-15T21:39:27Z

/botio test

pdfjsbot · 2017-09-15T21:39:28Z

From: Bot.io (Windows)

Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/f66b21324b55e0b/output.txt

pdfjsbot · 2017-09-15T21:39:28Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/cd1e2a316abe5db/output.txt

pdfjsbot · 2017-09-15T21:56:14Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/cd1e2a316abe5db/output.txt

Total script time: 16.77 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2017-09-15T22:09:21Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/f66b21324b55e0b/output.txt

Total script time: 29.87 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

yurydelendik · 2017-09-18T23:29:20Z

src/display/dom_utils.js

+  parseFromString(data) {
+    let nodes = [];
+
+    data = data.replace(/<\?[\s\S]*?\?>|<!--[\s\S]*?-->/g, '').trim();


comment before: // Remove all comments and processing instructions

yurydelendik · 2017-09-18T23:29:25Z

src/display/metadata.js

-    return typeof this.metadata[name] !== 'undefined';
-  },
-};
+  getAll() {


keep 'metadata', just have deprecated() and getAll() calls inside the getter.

yurydelendik · 2017-09-18T23:30:01Z

src/display/dom_utils.js

+    let nodes = [];
+
+    data = data.replace(/<\?[\s\S]*?\?>|<!--[\s\S]*?-->/g, '').trim();
+    data = data.replace(/>([^<][\s\S]*?)</g, (all, text) => {


comment before: // Extract all text nodes and replace them with numeric index in the nodes.

yurydelendik · 2017-09-18T23:30:32Z

src/display/dom_utils.js

+      }
+      return '>' + length + ',<';
+    });
+    data = data.replace(/<!\[CDATA\[([\s\S]*?)\]\]>/g,


// Extract all CDATA nodes

yurydelendik · 2017-09-18T23:32:54Z

src/display/dom_utils.js

+      return length + ',';
+    });
+
+    let regex =


comment before: // Until such nodes without '<' and '>' content are present, replace these with numeric index in the nodes.

yurydelendik · 2017-09-18T23:34:17Z

src/display/dom_utils.js

+        return length + ',';
+      });
+    } while (lastLength < nodes.length);
+


// We shall have only one root index left, which will be last in the nodes.

yurydelendik · 2017-09-18T23:37:28Z

src/display/metadata.js

+      let parser = new SimpleXMLParser();
+      data = parser.parseFromString(data);
+    } else if (!(data instanceof Document)) {
+      throw new Error('Metadata: input is not a string or `Document`');


do we need instanceof Document check? if not let's replace if (typeof data === 'string' with assert.

The `DOMParser` is most likely overkill and may be less secure. Moreover, it is not supported in Node.js environments. This patch replaces the `DOMParser` with a simple XML parser. This should be faster and gives us Node.js support for free. The simple XML parser is a port of the one that existed in the examples folder with a small regex fix to make the parsing work correctly. The unit tests are extended for increased test coverage of the metadata code. The new method `getAll` is provided so the example does not have to access internal properties of the object anymore.

yurydelendik · 2017-09-20T21:05:20Z

/botio-linux preview

pdfjsbot · 2017-09-20T21:05:21Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @yurydelendik received. Current queue size: 0

Live output at: http://54.67.70.0:8877/20f73f6b7062bd2/output.txt

pdfjsbot · 2017-09-20T21:07:39Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/20f73f6b7062bd2/output.txt

Total script time: 2.29 mins

Published

yurydelendik

Thank you for the patch.

timvandermeij · 2017-09-20T21:12:06Z

/botio test

pdfjsbot · 2017-09-20T21:12:07Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/2a20f49870c32f1/output.txt

pdfjsbot · 2017-09-20T21:12:07Z

From: Bot.io (Windows)

Received

Command cmd_test from @timvandermeij received. Current queue size: 1

Live output at: http://54.215.176.217:8877/5551a66cea49d99/output.txt

pdfjsbot · 2017-09-20T21:28:45Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/2a20f49870c32f1/output.txt

Total script time: 16.62 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2017-09-20T21:42:58Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/5551a66cea49d99/output.txt

Total script time: 29.62 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

[api-minor] Replace `DOMParser` with `SimpleXMLParser`

timvandermeij added the core label Sep 15, 2017

timvandermeij requested a review from yurydelendik September 15, 2017 20:35

mozilla deleted a comment from pdfjsbot Sep 15, 2017

Snuffleupagus reviewed Sep 15, 2017

View reviewed changes

timvandermeij force-pushed the xml-parser branch from 9d19e81 to 942b962 Compare September 15, 2017 21:00

Snuffleupagus reviewed Sep 15, 2017

View reviewed changes

timvandermeij force-pushed the xml-parser branch from 942b962 to 178b2d5 Compare September 15, 2017 21:16

yurydelendik reviewed Sep 18, 2017

View reviewed changes

yurydelendik changed the title ~~Replace DOMParser with SimpleXMLParser~~ [api-minor] Replace DOMParser with SimpleXMLParser Sep 18, 2017

Convert src/display/metadata.js to ES6 syntax

bc9afdf

timvandermeij force-pushed the xml-parser branch from 178b2d5 to 71f87d4 Compare September 19, 2017 20:20

timvandermeij added 2 commits September 19, 2017 23:09

Enable metadata unit tests for Travis CI and Node.js

2281061

timvandermeij force-pushed the xml-parser branch from 71f87d4 to 2281061 Compare September 19, 2017 21:14

yurydelendik approved these changes Sep 20, 2017

View reviewed changes

timvandermeij merged commit d7b37ae into mozilla:master Sep 20, 2017

timvandermeij deleted the xml-parser branch September 20, 2017 21:45

yurydelendik mentioned this pull request Oct 2, 2017

Version 1.10 #8986

Merged

Snuffleupagus mentioned this pull request Dec 13, 2017

Handle broken, Ghostscript generated, Metadata that contains HTML character names (bug 1424938) #9271

Merged

movsb pushed a commit to movsb/pdf.js that referenced this pull request Jul 14, 2018

Merge pull request mozilla#8912 from timvandermeij/xml-parser

e09f16e

[api-minor] Replace `DOMParser` with `SimpleXMLParser`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api-minor] Replace `DOMParser` with `SimpleXMLParser` #8912

[api-minor] Replace `DOMParser` with `SimpleXMLParser` #8912

timvandermeij commented Sep 15, 2017 •

edited

Loading

Snuffleupagus left a comment

Snuffleupagus Sep 15, 2017

timvandermeij Sep 15, 2017

Snuffleupagus Sep 15, 2017

timvandermeij Sep 15, 2017

timvandermeij commented Sep 15, 2017

pdfjsbot commented Sep 15, 2017

pdfjsbot commented Sep 15, 2017

timvandermeij commented Sep 15, 2017

pdfjsbot commented Sep 15, 2017

pdfjsbot commented Sep 15, 2017

pdfjsbot commented Sep 15, 2017

pdfjsbot commented Sep 15, 2017

yurydelendik Sep 18, 2017

yurydelendik Sep 18, 2017

yurydelendik Sep 18, 2017

yurydelendik Sep 18, 2017

yurydelendik Sep 18, 2017

yurydelendik Sep 18, 2017

yurydelendik Sep 18, 2017

yurydelendik commented Sep 20, 2017

pdfjsbot commented Sep 20, 2017

pdfjsbot commented Sep 20, 2017

yurydelendik left a comment

timvandermeij commented Sep 20, 2017

pdfjsbot commented Sep 20, 2017

pdfjsbot commented Sep 20, 2017

pdfjsbot commented Sep 20, 2017

pdfjsbot commented Sep 20, 2017

[api-minor] Replace DOMParser with SimpleXMLParser #8912

[api-minor] Replace DOMParser with SimpleXMLParser #8912

Conversation

timvandermeij commented Sep 15, 2017 • edited Loading

Snuffleupagus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timvandermeij commented Sep 15, 2017

pdfjsbot commented Sep 15, 2017

From: Bot.io (Linux m4)

Received

pdfjsbot commented Sep 15, 2017

From: Bot.io (Linux m4)

Success

Published

timvandermeij commented Sep 15, 2017

pdfjsbot commented Sep 15, 2017

From: Bot.io (Windows)

Received

pdfjsbot commented Sep 15, 2017

From: Bot.io (Linux m4)

Received

pdfjsbot commented Sep 15, 2017

From: Bot.io (Linux m4)

Success

pdfjsbot commented Sep 15, 2017

From: Bot.io (Windows)

Success

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurydelendik commented Sep 20, 2017

pdfjsbot commented Sep 20, 2017

From: Bot.io (Linux m4)

Received

pdfjsbot commented Sep 20, 2017

From: Bot.io (Linux m4)

Success

Published

yurydelendik left a comment

Choose a reason for hiding this comment

timvandermeij commented Sep 20, 2017

pdfjsbot commented Sep 20, 2017

From: Bot.io (Linux m4)

Received

pdfjsbot commented Sep 20, 2017

From: Bot.io (Windows)

Received

pdfjsbot commented Sep 20, 2017

From: Bot.io (Linux m4)

Success

pdfjsbot commented Sep 20, 2017

From: Bot.io (Windows)

Success

[api-minor] Replace `DOMParser` with `SimpleXMLParser` #8912

[api-minor] Replace `DOMParser` with `SimpleXMLParser` #8912

timvandermeij commented Sep 15, 2017 •

edited

Loading