Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-minor] Replace DOMParser with SimpleXMLParser #8912

Merged
merged 3 commits into from
Sep 20, 2017

Conversation

timvandermeij
Copy link
Contributor

@timvandermeij timvandermeij commented Sep 15, 2017

The DOMParser is most likely overkill and may be less secure. Moreover, it is not supported in Node.js environments.

This patch replaces the DOMParser with a simple XML parser. This should be faster and gives us Node.js support for free. The simple XML parser is a port of the one that existed in the examples folder with improved regexes to make the parsing work properly.

The unit tests are extended for increased test coverage of the metadata code. The new method getAll is provided so the example does not have to access internal properties of the object anymore.

Fixes #8903.

@mozilla mozilla deleted a comment from pdfjsbot Sep 15, 2017
@mozilla mozilla deleted a comment from pdfjsbot Sep 15, 2017
@mozilla mozilla deleted a comment from pdfjsbot Sep 15, 2017
@mozilla mozilla deleted a comment from pdfjsbot Sep 15, 2017
Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice simplification!
One question though: Would it maybe make more sense to place the SimpleXMLParser (and related code) in a utility file instead (such as /src/display/dom_utils.js)?

do {
lastLength = nodes.length;
data = data.replace(
/<([\w\:]+)((?:[\s\w:=]|'[^']*'|"[^"]*")*)(?:\/>|>([\d,]*)<\/[^>]+>)/g,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a regular expression over and over in a loop, would it make sense to just define it once instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the new commit, together with the code move to the utility file. Good points, thank you!


// Convert the string to a DOM `Document`.
let parser = new SimpleXMLParser();
data = parser.parseFromString(data, 'application/xml');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It appears that the parseFromString method (of the SimpleXMLParser) only takes one argument, so presumably we no longer need to pass in 'application/xml' unless I'm missing something here!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh! You're right, since it's now specialized to an XML parser.

@timvandermeij
Copy link
Contributor Author

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/65ed2537014a760/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/65ed2537014a760/output.txt

Total script time: 2.32 mins

Published

@timvandermeij
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/f66b21324b55e0b/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/cd1e2a316abe5db/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/cd1e2a316abe5db/output.txt

Total script time: 16.77 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/f66b21324b55e0b/output.txt

Total script time: 29.87 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

parseFromString(data) {
let nodes = [];

data = data.replace(/<\?[\s\S]*?\?>|<!--[\s\S]*?-->/g, '').trim();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment before: // Remove all comments and processing instructions

return typeof this.metadata[name] !== 'undefined';
},
};
getAll() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep 'metadata', just have deprecated() and getAll() calls inside the getter.

let nodes = [];

data = data.replace(/<\?[\s\S]*?\?>|<!--[\s\S]*?-->/g, '').trim();
data = data.replace(/>([^<][\s\S]*?)</g, (all, text) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment before: // Extract all text nodes and replace them with numeric index in the nodes.

}
return '>' + length + ',<';
});
data = data.replace(/<!\[CDATA\[([\s\S]*?)\]\]>/g,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Extract all CDATA nodes

return length + ',';
});

let regex =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment before: // Until such nodes without '<' and '>' content are present, replace these with numeric index in the nodes.

return length + ',';
});
} while (lastLength < nodes.length);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// We shall have only one root index left, which will be last in the nodes.

let parser = new SimpleXMLParser();
data = parser.parseFromString(data);
} else if (!(data instanceof Document)) {
throw new Error('Metadata: input is not a string or `Document`');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need instanceof Document check? if not let's replace if (typeof data === 'string' with assert.

@yurydelendik yurydelendik changed the title Replace DOMParser with SimpleXMLParser [api-minor] Replace DOMParser with SimpleXMLParser Sep 18, 2017
The `DOMParser` is most likely overkill and may be less secure.
Moreover, it is not supported in Node.js environments.

This patch replaces the `DOMParser` with a simple XML parser. This
should be faster and gives us Node.js support for free. The simple XML
parser is a port of the one that existed in the examples folder with a
small regex fix to make the parsing work correctly.

The unit tests are extended for increased test coverage of the metadata
code. The new method `getAll` is provided so the example does not have
to access internal properties of the object anymore.
@yurydelendik
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @yurydelendik received. Current queue size: 0

Live output at: http://54.67.70.0:8877/20f73f6b7062bd2/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/20f73f6b7062bd2/output.txt

Total script time: 2.29 mins

Published

Copy link
Contributor

@yurydelendik yurydelendik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch.

@timvandermeij
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/2a20f49870c32f1/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @timvandermeij received. Current queue size: 1

Live output at: http://54.215.176.217:8877/5551a66cea49d99/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/2a20f49870c32f1/output.txt

Total script time: 16.62 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/5551a66cea49d99/output.txt

Total script time: 29.62 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@timvandermeij timvandermeij merged commit d7b37ae into mozilla:master Sep 20, 2017
@timvandermeij timvandermeij deleted the xml-parser branch September 20, 2017 21:45
@yurydelendik yurydelendik mentioned this pull request Oct 2, 2017
movsb pushed a commit to movsb/pdf.js that referenced this pull request Jul 14, 2018
[api-minor] Replace `DOMParser` with `SimpleXMLParser`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants