Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New XML parser #9573

Merged
merged 1 commit into from
Mar 22, 2018
Merged

New XML parser #9573

merged 1 commit into from
Mar 22, 2018

Conversation

yurydelendik
Copy link
Contributor

@yurydelendik yurydelendik commented Mar 16, 2018

Improves performance of the RegExp based SimpleXMLParser.

The code for XMLParserBase copied from
https://github.com/mozilla/shumway/blob/16451d8836fa85f4b16eeda8b4bda2fa9e2b22b0/src/avm2/natives/xml.ts

case 'amp':
return '&';
case 'quot':
return '\"';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code also handled apos, see https://github.com/mozilla/pdf.js/pull/9573/files#diff-3cb21ac6c8ce15118704e14df3dcc9b2L256; is that no longer necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's HTML thing, but we can add it there too

@timvandermeij
Copy link
Contributor

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/4a8eb8977cfd151/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/0638ec9c7e0b33e/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/4a8eb8977cfd151/output.txt

Total script time: 18.17 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/0638ec9c7e0b33e/output.txt

Total script time: 24.16 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/80445c7126554ab/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/80445c7126554ab/output.txt

Total script time: 2.70 mins

Published

Copy link
Contributor

@timvandermeij timvandermeij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good to me (since it already worked in Shumway, I glanced over the details of the XML parsing), but there are some final comments I'd like to see addressed before approving this. Thanks!


function isWhitespaceString(s) {
for (let i = 0; i < s.length; i++) {
const ch = s[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just call isWhitespace(s[i]) here so we don't have to copy the logic. Also, let's cache s.length using for (let i = 0, ii = s.length; i < ii; i++) like we do elsewhere in the codebase.

@@ -23,13 +23,15 @@ class Metadata {
// Ghostscript may produce invalid metadata, so try to repair that first.
data = this._repair(data);

// Convert the string to a DOM `Document`.
// Convert the string to a XML document.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: an XML document

name = s.substring(start, pos);
skipWs();
while (pos < s.length && s[pos] !== '>' &&
s[pos] !== '/' && s[pos] !== '?') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should to be aligned with pos for readability.

const doctypeContent =
s.substring(j + 8, q + (complexDoctype ? 1 : 0));
this.onDoctype(doctypeContent);
// XXX pull entities ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Can we remove it?

} else {
do {
/* skipping text items */
} while (j++ < s.length && s[j] !== '<');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a do .. while loop here instead of a regular while loop as used above?

}
}

export class SimpleXMLParser extends XMLParserBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we put the exports at the bottom of the file. Can we do that here too for consistency (compare https://github.com/mozilla/pdf.js/pull/9573/files#diff-3cb21ac6c8ce15118704e14df3dcc9b2L405)?

super();
this._currentFragment = null;
this._stack = null;
this._errorCode = XMLParserErrorCode.NoError;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseFromString sets this too, so should we just set it too null here too like the other two members?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't want to change shape of the object by keep _errorCode a number. I can add XMLParserErrorCode.Unknown value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, let's keep it like how it is now for simplicity. Thanks!

@timvandermeij
Copy link
Contributor

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/bd3c487b19dd6fb/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @timvandermeij received. Current queue size: 1

Live output at: http://54.215.176.217:8877/81d163b152949eb/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/bd3c487b19dd6fb/output.txt

Total script time: 18.20 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/81d163b152949eb/output.txt

Total script time: 24.27 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@brendandahl brendandahl merged commit 24f766b into mozilla:master Mar 22, 2018
@Rob--W
Copy link
Member

Rob--W commented Mar 22, 2018

Why is a custom XML parser used instead of the standard DOMParser API ( https://developer.mozilla.org/en-US/docs/Web/API/DOMParser )? All web browsers, even IE 9+ support this API.

@Snuffleupagus
Copy link
Collaborator

Why is a custom XML parser used instead of the standard DOMParser API ( https://developer.mozilla.org/en-US/docs/Web/API/DOMParser )? All web browsers, even IE 9+ support this API.

Please see issue #8903; and also https://bugzilla.mozilla.org/show_bug.cgi?id=1386676.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants