Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve AcroForm/XFA form type detection #12271

Merged
merged 6 commits into from
Aug 25, 2020

Conversation

timvandermeij
Copy link
Contributor

@timvandermeij timvandermeij commented Aug 23, 2020

The commit messages contain more information about the individual changes.

Fixes #12217.
Replaces #12254.

For completeness, here is the list of test PDF files with the form type detection results from before/after this patch (note that we only fallback if only XFA is available):

File Before After Notes
http://www.aloaha.com/wp-content/uploads/2016/07/SampleForm-1.pdf AcroForm, XFA, fallback AcroForm, XFA, no fallback
http://www.cic.gc.ca/english/pdf/kits/forms/IMM5257E.PDF AcroForm, XFA, fallback XFA, fallback SigFlags bit set and nested signature in Fields.
https://web.archive.org/web/20121105185256if_/http://www.northeastern.edu/hrm/pdfs/resources/benefits/MBTA-pretax-form-July2012.pdf AcroForm, XFA, fallback XFA, fallback Empty Fields.
https://web.archive.org/web/20110918100215/http://www.irs.gov/pub/irs-pdf/f1040.pdf AcroForm, no fallback AcroForm, no fallback SigFlags available, but first bit not set.
https://github.com/mozilla/pdf.js/files/762326/212241.6.pdf AcroForm, no fallback AcroForm, no fallback SigFlags bit set and multiple signatures in Fields.

@timvandermeij
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/bbc340abc77518a/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/de57b8f3bc29263/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/de57b8f3bc29263/output.txt

Total script time: 26.92 mins

  • Font tests: Passed
  • Unit tests: FAILED
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/de57b8f3bc29263/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/bbc340abc77518a/output.txt

Total script time: 30.06 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/bbc340abc77518a/reftest-analyzer.html#web=eq.log

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely looks like a big step in the right direction, given how complicated the AcroForm/XFA situation apparently is; however there's a couple of things I'd suggest changing.

I've not completely reviewed all of this code yet, notably the new unit-tests, but figured I'd wait with that until the patches are updated.

src/core/document.js Show resolved Hide resolved
src/core/obj.js Outdated Show resolved Hide resolved
src/core/obj.js Outdated Show resolved Hide resolved
src/core/document.js Outdated Show resolved Hide resolved
src/core/document.js Outdated Show resolved Hide resolved
src/core/document.js Outdated Show resolved Hide resolved
src/core/document.js Outdated Show resolved Hide resolved
src/core/document.js Outdated Show resolved Hide resolved
@timvandermeij timvandermeij force-pushed the acroform-type-detection branch 2 times, most recently from 6a1db6e to 0d6ab07 Compare August 24, 2020 22:08
@timvandermeij
Copy link
Contributor Author

Thank you for the review! I have addressed all comments in the new commit series.

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added three more comments, after which I believe that this is good to go :-)

This code now looks much nicer, thank you!

src/core/document.js Show resolved Hide resolved
src/core/document.js Show resolved Hide resolved
src/core/document.js Outdated Show resolved Hide resolved
src/core/document.js Outdated Show resolved Hide resolved
The `Version` entry is part of the catalog, not of the document, so its
logic should be placed there instead. The document should look in the
catalog to fetch it, and not have knowledge of `catDict`, which is a
member internal to the catalog.

Moreover, make the version member private on the document instance. It's
only used internally and was also never intended to be public. For users
it's exposed by the `getMetadata` API endpoint as `PDFFormatVersion`.

Finally, clarify how the version from the header and the version from
the catalog are treated using a comment.
The `Collection` entry is part of the catalog, not of the document, so
its logic should be placed there instead. The document should look in the
catalog to fetch it, and not have knowledge of `catDict`, which is a
member internal to the catalog.

Moreover, remove the collection member from the document instance. It's
only used internally and was also never intended to be public. For users
it's exposed by the `getMetadata` API endpoint as `IsCollectionPresent`.
Moving this out of the `parse` function makes sure that the getter is
only executed if the document information is actually requested
(potentially making initial parsing a tiny bit faster).
The `AcroForm` entry is part of the catalog, not of the document, so its
logic should be placed there instead. The document should look in the
catalog to fetch it, and not have knowledge of `catDict`, which is a
member internal to the catalog.

Moreover, make the AcroForm member private on the document instance. It's
only used internally and was also never intended to be public. For users
it's exposed by the `getMetadata` API endpoint as `IsAcroFormPresent`.
Only a boolean is exposed, so we now also only store the boolean on the
document instance.

Finally, the annotation code needs access to the full AcroForm
dictionary, so it's updated to fetch the data from the catalog instead
of the document that now only holds the boolean.
Not only is `catDict` never accessed anymore outside of this file, it
should also never happen since it's internal to the catalog. If data
from it is needed elsewhere, the catalog should provide a getter for it
that can do basic data integrity checks and abstract away any
unnecessary details.
Good form type detection is important to get reliable telemetry and to
only show the fallback bar if a form cannot be filled out by the user.

PDF.js only supports AcroForm data, so XFA data is explicitly unsupported
(tracked in issue mozilla#2373). However, the previous form type detection
couldn't separate AcroForm and XFA well enough, causing form type
telemetry to be incorrect sometimes and the fallback bar to be shown for
forms that could in fact be filled out by the user.

The solution in this commit is found by studying the specification and
the form documents that are available to us. In a nutshell the rules are:

- There is XFA data if the `XFA` entry is a non-empty array or stream.
- There is AcroForm data if the `Fields` entry is a non-empty array and
  it doesn't consist of only document signatures.

The document signatures part was not handled in the old code, causing a
document with only XFA data to also be marked as having AcroForm data.
Moreover, the old code didn't check all the data types.

Now that AcroForm and XFA can be distinguished, the viewer is configured
to only show the fallback bar for documents that only have XFA data. If
a document also has AcroForm data, the viewer can use that to render the
form. We have not found documents where the XFA data was necessary in
that case.

Finally, we include unit tests to ensure that all cases are covered and
move the form type detection out of the `parse` function so that it's
only executed if the document information is actually requested
(potentially making initial parsing a tiny bit faster).
….js`

Now that the `parse` method is simplified we can inline the `setup`
method in the `parse` method since it's only two lines of code. This
avoids some indirection.
@timvandermeij
Copy link
Contributor Author

/botio unittest

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/65b5194c1712e67/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_unittest from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/00b777703fc2b0f/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/65b5194c1712e67/output.txt

Total script time: 3.75 mins

  • Unit Tests: FAILED

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/00b777703fc2b0f/output.txt

Total script time: 4.89 mins

  • Unit Tests: Passed

@timvandermeij
Copy link
Contributor Author

/botio makeref

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/95d5c41c999ce5e/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/7537757be2b9fd0/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/95d5c41c999ce5e/output.txt

Total script time: 25.30 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/7537757be2b9fd0/output.txt

Total script time: 27.30 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notification for PDF containing forms with pref enabled still displayed on some files
3 participants