-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to handle corrupt PDF documents that contains path operators inside of text object (issue 10542) #10756
Conversation
// Note that this will effectively disable the optimization in the | ||
// `else` branch below, but given that this type of corruption is | ||
// *extremely* rare that shouldn't really matter much in practice. | ||
if (parsingText) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before adding the test-case, I did a SKIP_BABEL=true gulp test
run with assert(!parsingText);
added just above this line. Since there wasn't any test failures, that made me a bit less uneasy about the general approach of this patch.
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/3f473ad05715977/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/ff0b86d6ce4f23b/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/3f473ad05715977/output.txt Total script time: 17.87 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.215.176.217:8877/ff0b86d6ce4f23b/output.txt Total script time: 25.47 mins
|
/botio-linux preview |
From: Bot.io (Linux m4)ReceivedCommand cmd_preview from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/1fe04892cdee821/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/1fe04892cdee821/output.txt Total script time: 1.85 mins Published |
6cd4b36
to
fa54a66
Compare
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/5e92d24d28488c9/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/ff74b15b053e323/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/5e92d24d28488c9/output.txt Total script time: 17.68 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.215.176.217:8877/ff74b15b053e323/output.txt Total script time: 25.65 mins
|
On the other hand: By adding explicit |
/botio-linux lint |
From: Bot.io (Linux m4)ReceivedCommand cmd_lint from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/fa08773858b4df9/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/fa08773858b4df9/output.txt Total script time: 0.98 mins
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which probably doesn't hurt given how extremely common those operators are
I would agree, and I'm fine with the approach, but do you perhaps have a small benchmark of the performance before/after for e.g., the Tracemonkey paper or another document that has many beginText
/endText
operators for verification?
src/core/evaluator.js
Outdated
break; | ||
case OPS.endText: | ||
parsingText = false; | ||
args = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does args
need to be set to null
here and above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a "better safe than sorry" approach really, since neither of the BT
/ET
operators should ever have parameters. However, considering the kind of checks that happen in the default
branch, just always resetting the args
seemed easiest here; see
Lines 1180 to 1194 in f87dc42
default: | |
// Note: Ignore the operator if it has `Dict` arguments, since | |
// those are non-serializable, otherwise postMessage will throw | |
// "An object could not be cloned.". | |
if (args !== null) { | |
for (i = 0, ii = args.length; i < ii; i++) { | |
if (args[i] instanceof Dict) { | |
break; | |
} | |
} | |
if (i < ii) { | |
warn('getOperatorList - ignoring operator: ' + fn); | |
continue; | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but now that we're actually handling these cases explicitly we don't enter the default
branch for them anymore, so unless I'm missing something here the args
are not actually used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] so unless I'm missing something here the
args
are not actually used anywhere.
That's absolutely correct, but I was a bit worried that
Lines 2991 to 3004 in f87dc42
if (argsLength !== numArgs) { | |
var nonProcessedArgs = this.nonProcessedArgs; | |
while (argsLength > numArgs) { | |
nonProcessedArgs.push(args.shift()); | |
argsLength--; | |
} | |
while (argsLength < numArgs && nonProcessedArgs.length !== 0) { | |
if (args === null) { | |
args = []; | |
} | |
args.unshift(nonProcessedArgs.pop()); | |
argsLength++; | |
} | |
} |
BT
/ET
operator to end up with an args
array.Given that
getOperatorList
always calls EvaluatorPreprocessor.read
as outlined in Lines 2961 to 2967 in f87dc42
// - |null|. This indicates that the caller needs this function to create | |
// the array in which any args are stored in. If there are zero args, | |
// this function will leave |operation.args| as |null| (thus avoiding | |
// allocations that would occur if we used an empty array to represent | |
// zero arguments). Otherwise, it will replace |null| with a new array | |
// containing the arguments. The caller should use this value if it | |
// cannot reuse an array for each call to read(). |
Really good point!
I've not bothered to do that, since finding a test-case with a relatively large number of My point above though is that handling the |
Yes, that's a valid point. Let's finish the comment above and then it should be fine by me. |
…inside of text object (issue 10542) First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to). Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit *very* small) to the parsing of path operators in PDF documents just for a handful of *corrupt* ones.
fa54a66
to
5335285
Compare
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/24122574b0bf1cf/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/350f5e4efab6406/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/24122574b0bf1cf/output.txt Total script time: 17.61 mins
|
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/350f5e4efab6406/output.txt Total script time: 25.76 mins
Image differences available at: http://54.215.176.217:8877/350f5e4efab6406/reftest-analyzer.html#web=eq.log |
/botio makeref |
From: Bot.io (Linux m4)ReceivedCommand cmd_makeref from @timvandermeij received. Current queue size: 0 Live output at: http://54.67.70.0:8877/d06723d1f67f6cc/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_makeref from @timvandermeij received. Current queue size: 0 Live output at: http://54.215.176.217:8877/1fbdeb35853e570/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/d06723d1f67f6cc/output.txt Total script time: 16.09 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.215.176.217:8877/1fbdeb35853e570/output.txt Total script time: 22.82 mins
|
Thank you! |
First of all, while this simple approach appears to work OK in practice I'm not sure if it's the best way of addressing the problem (assuming that you even want to).
Second of all, while the solution implemented here only requires tracking/checking one new boolean in order for this to work, I'm nonetheless not entirely happy about this since it will add additional overhead (albeit very small) to the parsing of all PDF documents just for a handful of corrupt ones.
Fixes #10542.
@janpe2 As always; thank you for the excellent debugging done in #10542 (comment) :-)