-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve AcroForm/XFA form type detection #12271
Improve AcroForm/XFA form type detection #12271
Conversation
8ed7bbe
to
9608c5c
Compare
/botio test |
From: Bot.io (Windows)ReceivedCommand cmd_test from @timvandermeij received. Current queue size: 0 Live output at: http://54.215.176.217:8877/bbc340abc77518a/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @timvandermeij received. Current queue size: 0 Live output at: http://54.67.70.0:8877/de57b8f3bc29263/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/de57b8f3bc29263/output.txt Total script time: 26.92 mins
Image differences available at: http://54.67.70.0:8877/de57b8f3bc29263/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/bbc340abc77518a/output.txt Total script time: 30.06 mins
Image differences available at: http://54.215.176.217:8877/bbc340abc77518a/reftest-analyzer.html#web=eq.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely looks like a big step in the right direction, given how complicated the AcroForm/XFA situation apparently is; however there's a couple of things I'd suggest changing.
I've not completely reviewed all of this code yet, notably the new unit-tests, but figured I'd wait with that until the patches are updated.
6a1db6e
to
0d6ab07
Compare
Thank you for the review! I have addressed all comments in the new commit series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added three more comments, after which I believe that this is good to go :-)
This code now looks much nicer, thank you!
The `Version` entry is part of the catalog, not of the document, so its logic should be placed there instead. The document should look in the catalog to fetch it, and not have knowledge of `catDict`, which is a member internal to the catalog. Moreover, make the version member private on the document instance. It's only used internally and was also never intended to be public. For users it's exposed by the `getMetadata` API endpoint as `PDFFormatVersion`. Finally, clarify how the version from the header and the version from the catalog are treated using a comment.
The `Collection` entry is part of the catalog, not of the document, so its logic should be placed there instead. The document should look in the catalog to fetch it, and not have knowledge of `catDict`, which is a member internal to the catalog. Moreover, remove the collection member from the document instance. It's only used internally and was also never intended to be public. For users it's exposed by the `getMetadata` API endpoint as `IsCollectionPresent`. Moving this out of the `parse` function makes sure that the getter is only executed if the document information is actually requested (potentially making initial parsing a tiny bit faster).
The `AcroForm` entry is part of the catalog, not of the document, so its logic should be placed there instead. The document should look in the catalog to fetch it, and not have knowledge of `catDict`, which is a member internal to the catalog. Moreover, make the AcroForm member private on the document instance. It's only used internally and was also never intended to be public. For users it's exposed by the `getMetadata` API endpoint as `IsAcroFormPresent`. Only a boolean is exposed, so we now also only store the boolean on the document instance. Finally, the annotation code needs access to the full AcroForm dictionary, so it's updated to fetch the data from the catalog instead of the document that now only holds the boolean.
Not only is `catDict` never accessed anymore outside of this file, it should also never happen since it's internal to the catalog. If data from it is needed elsewhere, the catalog should provide a getter for it that can do basic data integrity checks and abstract away any unnecessary details.
Good form type detection is important to get reliable telemetry and to only show the fallback bar if a form cannot be filled out by the user. PDF.js only supports AcroForm data, so XFA data is explicitly unsupported (tracked in issue mozilla#2373). However, the previous form type detection couldn't separate AcroForm and XFA well enough, causing form type telemetry to be incorrect sometimes and the fallback bar to be shown for forms that could in fact be filled out by the user. The solution in this commit is found by studying the specification and the form documents that are available to us. In a nutshell the rules are: - There is XFA data if the `XFA` entry is a non-empty array or stream. - There is AcroForm data if the `Fields` entry is a non-empty array and it doesn't consist of only document signatures. The document signatures part was not handled in the old code, causing a document with only XFA data to also be marked as having AcroForm data. Moreover, the old code didn't check all the data types. Now that AcroForm and XFA can be distinguished, the viewer is configured to only show the fallback bar for documents that only have XFA data. If a document also has AcroForm data, the viewer can use that to render the form. We have not found documents where the XFA data was necessary in that case. Finally, we include unit tests to ensure that all cases are covered and move the form type detection out of the `parse` function so that it's only executed if the document information is actually requested (potentially making initial parsing a tiny bit faster).
….js` Now that the `parse` method is simplified we can inline the `setup` method in the `parse` method since it's only two lines of code. This avoids some indirection.
0d6ab07
to
0f229d5
Compare
/botio unittest |
From: Bot.io (Linux m4)ReceivedCommand cmd_unittest from @timvandermeij received. Current queue size: 0 Live output at: http://54.67.70.0:8877/65b5194c1712e67/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_unittest from @timvandermeij received. Current queue size: 0 Live output at: http://54.215.176.217:8877/00b777703fc2b0f/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/65b5194c1712e67/output.txt Total script time: 3.75 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.215.176.217:8877/00b777703fc2b0f/output.txt Total script time: 4.89 mins
|
/botio makeref |
From: Bot.io (Linux m4)ReceivedCommand cmd_makeref from @timvandermeij received. Current queue size: 0 Live output at: http://54.67.70.0:8877/95d5c41c999ce5e/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_makeref from @timvandermeij received. Current queue size: 0 Live output at: http://54.215.176.217:8877/7537757be2b9fd0/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/95d5c41c999ce5e/output.txt Total script time: 25.30 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.215.176.217:8877/7537757be2b9fd0/output.txt Total script time: 27.30 mins
|
The commit messages contain more information about the individual changes.
Fixes #12217.
Replaces #12254.
For completeness, here is the list of test PDF files with the form type detection results from before/after this patch (note that we only fallback if only XFA is available):
SigFlags
bit set and nested signature inFields
.Fields
.SigFlags
available, but first bit not set.SigFlags
bit set and multiple signatures inFields
.