Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) #10756

Merged
merged 1 commit into from
May 2, 2019

Conversation

Snuffleupagus
Copy link
Collaborator

First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to).
Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit very small) to the parsing of all PDF documents just for a handful of corrupt ones.

Fixes #10542.


@janpe2 As always; thank you for the excellent debugging done in #10542 (comment) :-)

// Note that this will effectively disable the optimization in the
// `else` branch below, but given that this type of corruption is
// *extremely* rare that shouldn't really matter much in practice.
if (parsingText) {
Copy link
Collaborator Author

@Snuffleupagus Snuffleupagus Apr 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before adding the test-case, I did a SKIP_BABEL=true gulp test run with assert(!parsingText); added just above this line. Since there wasn't any test failures, that made me a bit less uneasy about the general approach of this patch.

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/3f473ad05715977/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/ff0b86d6ce4f23b/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/3f473ad05715977/output.txt

Total script time: 17.87 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/ff0b86d6ce4f23b/output.txt

Total script time: 25.47 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@Snuffleupagus
Copy link
Collaborator Author

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/1fe04892cdee821/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/1fe04892cdee821/output.txt

Total script time: 1.85 mins

Published

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/5e92d24d28488c9/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/ff74b15b053e323/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/5e92d24d28488c9/output.txt

Total script time: 17.68 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/ff74b15b053e323/output.txt

Total script time: 25.65 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@Snuffleupagus
Copy link
Collaborator Author

I'm nonetheless not entirely happy about this since it will add additional overhead (albeit very small) to the parsing of all PDF documents just for a handful of corrupt ones.

On the other hand: By adding explicit beginText/endText cases to getOperatorList you avoid having to handle them in the default branch, which probably doesn't hurt given how extremely common those operators are.

@Snuffleupagus
Copy link
Collaborator Author

/botio-linux lint

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_lint from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/fa08773858b4df9/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/fa08773858b4df9/output.txt

Total script time: 0.98 mins

  • Lint: Passed

Copy link
Contributor

@timvandermeij timvandermeij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which probably doesn't hurt given how extremely common those operators are

I would agree, and I'm fine with the approach, but do you perhaps have a small benchmark of the performance before/after for e.g., the Tracemonkey paper or another document that has many beginText/endText operators for verification?

break;
case OPS.endText:
parsingText = false;
args = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does args need to be set to null here and above?

Copy link
Collaborator Author

@Snuffleupagus Snuffleupagus Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a "better safe than sorry" approach really, since neither of the BT/ET operators should ever have parameters. However, considering the kind of checks that happen in the default branch, just always resetting the args seemed easiest here; see

pdf.js/src/core/evaluator.js

Lines 1180 to 1194 in f87dc42

default:
// Note: Ignore the operator if it has `Dict` arguments, since
// those are non-serializable, otherwise postMessage will throw
// "An object could not be cloned.".
if (args !== null) {
for (i = 0, ii = args.length; i < ii; i++) {
if (args[i] instanceof Dict) {
break;
}
}
if (i < ii) {
warn('getOperatorList - ignoring operator: ' + fn);
continue;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but now that we're actually handling these cases explicitly we don't enter the default branch for them anymore, so unless I'm missing something here the args are not actually used anywhere.

Copy link
Collaborator Author

@Snuffleupagus Snuffleupagus Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] so unless I'm missing something here the args are not actually used anywhere.

That's absolutely correct, but I was a bit worried that

pdf.js/src/core/evaluator.js

Lines 2991 to 3004 in f87dc42

if (argsLength !== numArgs) {
var nonProcessedArgs = this.nonProcessedArgs;
while (argsLength > numArgs) {
nonProcessedArgs.push(args.shift());
argsLength--;
}
while (argsLength < numArgs && nonProcessedArgs.length !== 0) {
if (args === null) {
args = [];
}
args.unshift(nonProcessedArgs.pop());
argsLength++;
}
}
could somehow cause a BT/ET operator to end up with an args array.
Given that getOperatorList always calls EvaluatorPreprocessor.read as outlined in

pdf.js/src/core/evaluator.js

Lines 2961 to 2967 in f87dc42

// - |null|. This indicates that the caller needs this function to create
// the array in which any args are stored in. If there are zero args,
// this function will leave |operation.args| as |null| (thus avoiding
// allocations that would occur if we used an empty array to represent
// zero arguments). Otherwise, it will replace |null| with a new array
// containing the arguments. The caller should use this value if it
// cannot reuse an array for each call to read().
that worry seems to be unfounded though, so I'll simply remove those lines again.

Really good point!

@Snuffleupagus
Copy link
Collaborator Author

[...] but do you perhaps have a small benchmark of the performance before/after for e.g., the Tracemonkey paper or another document that has many beginText/endText operators for verification?

I've not bothered to do that, since finding a test-case with a relatively large number of BT/ET operators isn't that easy; if benchmarking is a merge blocker I suppose I'll have to try and find the time to do that.

My point above though is that handling the BT/ET cases explicitly cannot possibly be less efficient than letting them fallback to the default case, and based on the runtime of the reftests on the bots there doesn't appear to be any general upwards trend in runtimes (compared to other recent runs on the bots).

@timvandermeij
Copy link
Contributor

My point above though is that handling the BT/ET cases explicitly cannot possibly be less efficient than letting them fallback to the default case, and based on the runtime of the reftests on the bots there doesn't appear to be any general upwards trend in runtimes (compared to other recent runs on the bots).

Yes, that's a valid point. Let's finish the comment above and then it should be fine by me.

…inside of text object (issue 10542)

First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to).
Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit *very* small) to the parsing of path operators in PDF documents just for a handful of *corrupt* ones.
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

pdfjsbot commented May 2, 2019

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/24122574b0bf1cf/output.txt

@pdfjsbot
Copy link

pdfjsbot commented May 2, 2019

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/350f5e4efab6406/output.txt

@pdfjsbot
Copy link

pdfjsbot commented May 2, 2019

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/24122574b0bf1cf/output.txt

Total script time: 17.61 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

pdfjsbot commented May 2, 2019

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/350f5e4efab6406/output.txt

Total script time: 25.76 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/350f5e4efab6406/reftest-analyzer.html#web=eq.log

@timvandermeij
Copy link
Contributor

/botio makeref

@pdfjsbot
Copy link

pdfjsbot commented May 2, 2019

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/d06723d1f67f6cc/output.txt

@pdfjsbot
Copy link

pdfjsbot commented May 2, 2019

From: Bot.io (Windows)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/1fbdeb35853e570/output.txt

@pdfjsbot
Copy link

pdfjsbot commented May 2, 2019

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/d06723d1f67f6cc/output.txt

Total script time: 16.09 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

pdfjsbot commented May 2, 2019

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/1fbdeb35853e570/output.txt

Total script time: 22.82 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@timvandermeij timvandermeij merged commit 155304a into mozilla:master May 2, 2019
@timvandermeij
Copy link
Contributor

Thank you!

@Snuffleupagus Snuffleupagus deleted the issue-10542 branch May 3, 2019 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some hyperlinks and rest of line with the hyperlink is not shown
3 participants