Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) #10756

Snuffleupagus · 2019-04-21T15:26:53Z

First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to).
Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit very small) to the parsing of all PDF documents just for a handful of corrupt ones.

Fixes #10542.

@janpe2 As always; thank you for the excellent debugging done in #10542 (comment) :-)

Snuffleupagus · 2019-04-21T15:45:05Z

src/core/evaluator.js

+        // Note that this will effectively disable the optimization in the
+        // `else` branch below, but given that this type of corruption is
+        // *extremely* rare that shouldn't really matter much in practice.
+        if (parsingText) {


Before adding the test-case, I did a SKIP_BABEL=true gulp test run with assert(!parsingText); added just above this line. Since there wasn't any test failures, that made me a bit less uneasy about the general approach of this patch.

pdfjsbot · 2019-04-21T15:45:32Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/3f473ad05715977/output.txt

pdfjsbot · 2019-04-21T15:45:32Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/ff0b86d6ce4f23b/output.txt

pdfjsbot · 2019-04-21T16:03:25Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/3f473ad05715977/output.txt

Total script time: 17.87 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2019-04-21T16:11:01Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/ff0b86d6ce4f23b/output.txt

Total script time: 25.47 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

Snuffleupagus · 2019-04-23T09:08:23Z

/botio-linux preview

pdfjsbot · 2019-04-23T09:08:25Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/1fe04892cdee821/output.txt

pdfjsbot · 2019-04-23T09:10:17Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/1fe04892cdee821/output.txt

Total script time: 1.85 mins

Published

Viewer: http://54.67.70.0:8877/1fe04892cdee821/web/viewer.html

pdfjsbot · 2019-04-28T12:11:57Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/5e92d24d28488c9/output.txt

pdfjsbot · 2019-04-28T12:11:57Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/ff74b15b053e323/output.txt

pdfjsbot · 2019-04-28T12:29:39Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/5e92d24d28488c9/output.txt

Total script time: 17.68 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2019-04-28T12:37:38Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/ff74b15b053e323/output.txt

Total script time: 25.65 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

Snuffleupagus · 2019-04-28T17:58:03Z

I'm nonetheless not entirely happy about this since it will add additional overhead (albeit very small) to the parsing of all PDF documents just for a handful of corrupt ones.

On the other hand: By adding explicit beginText/endText cases to getOperatorList you avoid having to handle them in the default branch, which probably doesn't hurt given how extremely common those operators are.

Snuffleupagus · 2019-04-29T11:56:42Z

/botio-linux lint

pdfjsbot · 2019-04-29T11:56:43Z

From: Bot.io (Linux m4)

Received

Command cmd_lint from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/fa08773858b4df9/output.txt

pdfjsbot · 2019-04-29T11:57:42Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/fa08773858b4df9/output.txt

Total script time: 0.98 mins

Lint: Passed

timvandermeij

which probably doesn't hurt given how extremely common those operators are

I would agree, and I'm fine with the approach, but do you perhaps have a small benchmark of the performance before/after for e.g., the Tracemonkey paper or another document that has many beginText/endText operators for verification?

timvandermeij · 2019-04-29T21:47:26Z

src/core/evaluator.js

+              break;
+            case OPS.endText:
+              parsingText = false;
+              args = null;


Why does args need to be set to null here and above?

It's a "better safe than sorry" approach really, since neither of the BT/ET operators should ever have parameters. However, considering the kind of checks that happen in the default branch, just always resetting the args seemed easiest here; see

pdf.js/src/core/evaluator.js

Lines 1180 to 1194 in f87dc42

default:

// Note: Ignore the operator if it has `Dict` arguments, since

// those are non-serializable, otherwise postMessage will throw

// "An object could not be cloned.".

if (args !== null) {

for (i = 0, ii = args.length; i < ii; i++) {

if (args[i] instanceof Dict) {

break;

}

}

if (i < ii) {

warn('getOperatorList - ignoring operator: ' + fn);

continue;

}

}

Right, but now that we're actually handling these cases explicitly we don't enter the default branch for them anymore, so unless I'm missing something here the args are not actually used anywhere.

[...] so unless I'm missing something here the args are not actually used anywhere.

That's absolutely correct, but I was a bit worried that

pdf.js/src/core/evaluator.js

Lines 2991 to 3004 in f87dc42

if (argsLength !== numArgs) {

var nonProcessedArgs = this.nonProcessedArgs;

while (argsLength > numArgs) {

nonProcessedArgs.push(args.shift());

argsLength--;

}

while (argsLength < numArgs && nonProcessedArgs.length !== 0) {

if (args === null) {

args = [];

}

args.unshift(nonProcessedArgs.pop());

argsLength++;

}

}

could somehow cause a BT/ET operator to end up with an args array.
Given that getOperatorList always calls EvaluatorPreprocessor.read as outlined in

pdf.js/src/core/evaluator.js

Lines 2961 to 2967 in f87dc42

// - |null|. This indicates that the caller needs this function to create

// the array in which any args are stored in. If there are zero args,

// this function will leave |operation.args| as |null| (thus avoiding

// allocations that would occur if we used an empty array to represent

// zero arguments). Otherwise, it will replace |null| with a new array

// containing the arguments. The caller should use this value if it

// cannot reuse an array for each call to read().

that worry seems to be unfounded though, so I'll simply remove those lines again.

Really good point!

Snuffleupagus · 2019-04-30T07:04:44Z

[...] but do you perhaps have a small benchmark of the performance before/after for e.g., the Tracemonkey paper or another document that has many beginText/endText operators for verification?

I've not bothered to do that, since finding a test-case with a relatively large number of BT/ET operators isn't that easy; if benchmarking is a merge blocker I suppose I'll have to try and find the time to do that.

My point above though is that handling the BT/ET cases explicitly cannot possibly be less efficient than letting them fallback to the default case, and based on the runtime of the reftests on the bots there doesn't appear to be any general upwards trend in runtimes (compared to other recent runs on the bots).

timvandermeij · 2019-04-30T21:16:29Z

My point above though is that handling the BT/ET cases explicitly cannot possibly be less efficient than letting them fallback to the default case, and based on the runtime of the reftests on the bots there doesn't appear to be any general upwards trend in runtimes (compared to other recent runs on the bots).

Yes, that's a valid point. Let's finish the comment above and then it should be fine by me.

…inside of text object (issue 10542) First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to). Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit *very* small) to the parsing of path operators in PDF documents just for a handful of *corrupt* ones.

Snuffleupagus · 2019-05-02T08:02:13Z

/botio test

pdfjsbot · 2019-05-02T08:02:15Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/24122574b0bf1cf/output.txt

pdfjsbot · 2019-05-02T08:02:15Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/350f5e4efab6406/output.txt

pdfjsbot · 2019-05-02T08:19:52Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/24122574b0bf1cf/output.txt

Total script time: 17.61 mins

Font tests: Passed
Unit tests: Passed
Regression tests: Passed

pdfjsbot · 2019-05-02T08:28:02Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/350f5e4efab6406/output.txt

Total script time: 25.76 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/350f5e4efab6406/reftest-analyzer.html#web=eq.log

timvandermeij · 2019-05-02T20:04:08Z

/botio makeref

pdfjsbot · 2019-05-02T20:04:09Z

From: Bot.io (Linux m4)

Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/d06723d1f67f6cc/output.txt

pdfjsbot · 2019-05-02T20:04:10Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/1fbdeb35853e570/output.txt

pdfjsbot · 2019-05-02T20:20:16Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/d06723d1f67f6cc/output.txt

Total script time: 16.09 mins

Lint: Passed
Make references: Passed
Check references: Passed

pdfjsbot · 2019-05-02T20:27:01Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/1fbdeb35853e570/output.txt

Total script time: 22.82 mins

Lint: Passed
Make references: Passed
Check references: Passed

timvandermeij · 2019-05-02T20:29:31Z

Thank you!

Snuffleupagus commented Apr 21, 2019

View reviewed changes

timvandermeij added the core label Apr 21, 2019

Snuffleupagus force-pushed the issue-10542 branch from 6cd4b36 to fa54a66 Compare April 28, 2019 12:10

timvandermeij reviewed Apr 29, 2019

View reviewed changes

Snuffleupagus force-pushed the issue-10542 branch from fa54a66 to 5335285 Compare April 30, 2019 21:36

timvandermeij approved these changes May 2, 2019

View reviewed changes

timvandermeij merged commit 155304a into mozilla:master May 2, 2019

Snuffleupagus deleted the issue-10542 branch May 3, 2019 10:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) #10756

Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) #10756

Snuffleupagus commented Apr 21, 2019

Snuffleupagus Apr 21, 2019 •

edited

Loading

pdfjsbot commented Apr 21, 2019

pdfjsbot commented Apr 21, 2019

pdfjsbot commented Apr 21, 2019

pdfjsbot commented Apr 21, 2019

Snuffleupagus commented Apr 23, 2019

pdfjsbot commented Apr 23, 2019

pdfjsbot commented Apr 23, 2019

pdfjsbot commented Apr 28, 2019

pdfjsbot commented Apr 28, 2019

pdfjsbot commented Apr 28, 2019

pdfjsbot commented Apr 28, 2019

Snuffleupagus commented Apr 28, 2019

Snuffleupagus commented Apr 29, 2019

pdfjsbot commented Apr 29, 2019

pdfjsbot commented Apr 29, 2019

timvandermeij left a comment

timvandermeij Apr 29, 2019

Snuffleupagus Apr 30, 2019 •

edited

Loading

timvandermeij Apr 30, 2019

Snuffleupagus Apr 30, 2019 •

edited

Loading

Snuffleupagus commented Apr 30, 2019

timvandermeij commented Apr 30, 2019

Snuffleupagus commented May 2, 2019

pdfjsbot commented May 2, 2019

pdfjsbot commented May 2, 2019

pdfjsbot commented May 2, 2019

pdfjsbot commented May 2, 2019

timvandermeij commented May 2, 2019

pdfjsbot commented May 2, 2019

pdfjsbot commented May 2, 2019

pdfjsbot commented May 2, 2019

pdfjsbot commented May 2, 2019

timvandermeij commented May 2, 2019

	default:
	// Note: Ignore the operator if it has `Dict` arguments, since
	// those are non-serializable, otherwise postMessage will throw
	// "An object could not be cloned.".
	if (args !== null) {
	for (i = 0, ii = args.length; i < ii; i++) {
	if (args[i] instanceof Dict) {
	break;
	}
	}
	if (i < ii) {
	warn('getOperatorList - ignoring operator: ' + fn);
	continue;
	}
	}

	if (argsLength !== numArgs) {
	var nonProcessedArgs = this.nonProcessedArgs;
	while (argsLength > numArgs) {
	nonProcessedArgs.push(args.shift());
	argsLength--;
	}
	while (argsLength < numArgs && nonProcessedArgs.length !== 0) {
	if (args === null) {
	args = [];
	}
	args.unshift(nonProcessedArgs.pop());
	argsLength++;
	}
	}

	// - \|null\|. This indicates that the caller needs this function to create
	// the array in which any args are stored in. If there are zero args,
	// this function will leave \|operation.args\| as \|null\| (thus avoiding
	// allocations that would occur if we used an empty array to represent
	// zero arguments). Otherwise, it will replace \|null\| with a new array
	// containing the arguments. The caller should use this value if it
	// cannot reuse an array for each call to read().

Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) #10756

Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) #10756

Conversation

Snuffleupagus commented Apr 21, 2019

Snuffleupagus Apr 21, 2019 • edited Loading

Choose a reason for hiding this comment

pdfjsbot commented Apr 21, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented Apr 21, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented Apr 21, 2019

From: Bot.io (Linux m4)

Success

pdfjsbot commented Apr 21, 2019

From: Bot.io (Windows)

Success

Snuffleupagus commented Apr 23, 2019

pdfjsbot commented Apr 23, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented Apr 23, 2019

From: Bot.io (Linux m4)

Success

Published

pdfjsbot commented Apr 28, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented Apr 28, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented Apr 28, 2019

From: Bot.io (Linux m4)

Success

pdfjsbot commented Apr 28, 2019

From: Bot.io (Windows)

Success

Snuffleupagus commented Apr 28, 2019

Snuffleupagus commented Apr 29, 2019

pdfjsbot commented Apr 29, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented Apr 29, 2019

From: Bot.io (Linux m4)

Success

timvandermeij left a comment

Choose a reason for hiding this comment

timvandermeij Apr 29, 2019

Choose a reason for hiding this comment

Snuffleupagus Apr 30, 2019 • edited Loading

Choose a reason for hiding this comment

timvandermeij Apr 30, 2019

Choose a reason for hiding this comment

Snuffleupagus Apr 30, 2019 • edited Loading

Choose a reason for hiding this comment

Snuffleupagus commented Apr 30, 2019

timvandermeij commented Apr 30, 2019

Snuffleupagus commented May 2, 2019

pdfjsbot commented May 2, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 2, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented May 2, 2019

From: Bot.io (Linux m4)

Success

pdfjsbot commented May 2, 2019

From: Bot.io (Windows)

Failed

timvandermeij commented May 2, 2019

pdfjsbot commented May 2, 2019

From: Bot.io (Linux m4)

Received

pdfjsbot commented May 2, 2019

From: Bot.io (Windows)

Received

pdfjsbot commented May 2, 2019

From: Bot.io (Linux m4)

Snuffleupagus Apr 21, 2019 •

edited

Loading

Snuffleupagus Apr 30, 2019 •

edited

Loading

Snuffleupagus Apr 30, 2019 •

edited

Loading