Improve AcroForm/XFA form type detection #12271

timvandermeij · 2020-08-23T20:22:24Z

The commit messages contain more information about the individual changes.

Fixes #12217.
Replaces #12254.

For completeness, here is the list of test PDF files with the form type detection results from before/after this patch (note that we only fallback if only XFA is available):

File	Before	After	Notes
http://www.aloaha.com/wp-content/uploads/2016/07/SampleForm-1.pdf	AcroForm, XFA, fallback	AcroForm, XFA, no fallback
http://www.cic.gc.ca/english/pdf/kits/forms/IMM5257E.PDF	AcroForm, XFA, fallback	XFA, fallback	`SigFlags` bit set and nested signature in `Fields`.
https://web.archive.org/web/20121105185256if_/http://www.northeastern.edu/hrm/pdfs/resources/benefits/MBTA-pretax-form-July2012.pdf	AcroForm, XFA, fallback	XFA, fallback	Empty `Fields`.
https://web.archive.org/web/20110918100215/http://www.irs.gov/pub/irs-pdf/f1040.pdf	AcroForm, no fallback	AcroForm, no fallback	`SigFlags` available, but first bit not set.
https://github.com/mozilla/pdf.js/files/762326/212241.6.pdf	AcroForm, no fallback	AcroForm, no fallback	`SigFlags` bit set and multiple signatures in `Fields`.

timvandermeij · 2020-08-23T21:21:58Z

/botio test

pdfjsbot · 2020-08-23T21:22:00Z

From: Bot.io (Windows)

Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/bbc340abc77518a/output.txt

pdfjsbot · 2020-08-23T21:22:00Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/de57b8f3bc29263/output.txt

pdfjsbot · 2020-08-23T21:48:56Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/de57b8f3bc29263/output.txt

Total script time: 26.92 mins

Font tests: Passed
Unit tests: FAILED
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/de57b8f3bc29263/reftest-analyzer.html#web=eq.log

pdfjsbot · 2020-08-23T21:52:04Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/bbc340abc77518a/output.txt

Total script time: 30.06 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/bbc340abc77518a/reftest-analyzer.html#web=eq.log

Snuffleupagus

This definitely looks like a big step in the right direction, given how complicated the AcroForm/XFA situation apparently is; however there's a couple of things I'd suggest changing.

I've not completely reviewed all of this code yet, notably the new unit-tests, but figured I'd wait with that until the patches are updated.

src/core/document.js

src/core/obj.js

src/core/document.js

timvandermeij · 2020-08-24T22:23:42Z

Thank you for the review! I have addressed all comments in the new commit series.

Snuffleupagus

I've added three more comments, after which I believe that this is good to go :-)

This code now looks much nicer, thank you!

src/core/document.js

The `Version` entry is part of the catalog, not of the document, so its logic should be placed there instead. The document should look in the catalog to fetch it, and not have knowledge of `catDict`, which is a member internal to the catalog. Moreover, make the version member private on the document instance. It's only used internally and was also never intended to be public. For users it's exposed by the `getMetadata` API endpoint as `PDFFormatVersion`. Finally, clarify how the version from the header and the version from the catalog are treated using a comment.

The `Collection` entry is part of the catalog, not of the document, so its logic should be placed there instead. The document should look in the catalog to fetch it, and not have knowledge of `catDict`, which is a member internal to the catalog. Moreover, remove the collection member from the document instance. It's only used internally and was also never intended to be public. For users it's exposed by the `getMetadata` API endpoint as `IsCollectionPresent`. Moving this out of the `parse` function makes sure that the getter is only executed if the document information is actually requested (potentially making initial parsing a tiny bit faster).

The `AcroForm` entry is part of the catalog, not of the document, so its logic should be placed there instead. The document should look in the catalog to fetch it, and not have knowledge of `catDict`, which is a member internal to the catalog. Moreover, make the AcroForm member private on the document instance. It's only used internally and was also never intended to be public. For users it's exposed by the `getMetadata` API endpoint as `IsAcroFormPresent`. Only a boolean is exposed, so we now also only store the boolean on the document instance. Finally, the annotation code needs access to the full AcroForm dictionary, so it's updated to fetch the data from the catalog instead of the document that now only holds the boolean.

Not only is `catDict` never accessed anymore outside of this file, it should also never happen since it's internal to the catalog. If data from it is needed elsewhere, the catalog should provide a getter for it that can do basic data integrity checks and abstract away any unnecessary details.

Good form type detection is important to get reliable telemetry and to only show the fallback bar if a form cannot be filled out by the user. PDF.js only supports AcroForm data, so XFA data is explicitly unsupported (tracked in issue mozilla#2373). However, the previous form type detection couldn't separate AcroForm and XFA well enough, causing form type telemetry to be incorrect sometimes and the fallback bar to be shown for forms that could in fact be filled out by the user. The solution in this commit is found by studying the specification and the form documents that are available to us. In a nutshell the rules are: - There is XFA data if the `XFA` entry is a non-empty array or stream. - There is AcroForm data if the `Fields` entry is a non-empty array and it doesn't consist of only document signatures. The document signatures part was not handled in the old code, causing a document with only XFA data to also be marked as having AcroForm data. Moreover, the old code didn't check all the data types. Now that AcroForm and XFA can be distinguished, the viewer is configured to only show the fallback bar for documents that only have XFA data. If a document also has AcroForm data, the viewer can use that to render the form. We have not found documents where the XFA data was necessary in that case. Finally, we include unit tests to ensure that all cases are covered and move the form type detection out of the `parse` function so that it's only executed if the document information is actually requested (potentially making initial parsing a tiny bit faster).

….js` Now that the `parse` method is simplified we can inline the `setup` method in the `parse` method since it's only two lines of code. This avoids some indirection.

timvandermeij · 2020-08-25T21:43:16Z

/botio unittest

pdfjsbot · 2020-08-25T21:43:18Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/65b5194c1712e67/output.txt

pdfjsbot · 2020-08-25T21:43:18Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/00b777703fc2b0f/output.txt

pdfjsbot · 2020-08-25T21:47:03Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/65b5194c1712e67/output.txt

Total script time: 3.75 mins

Unit Tests: FAILED

pdfjsbot · 2020-08-25T21:48:12Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/00b777703fc2b0f/output.txt

Total script time: 4.89 mins

Unit Tests: Passed

timvandermeij · 2020-08-25T21:50:30Z

/botio makeref

pdfjsbot · 2020-08-25T21:50:31Z

From: Bot.io (Linux m4)

Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/95d5c41c999ce5e/output.txt

pdfjsbot · 2020-08-25T21:50:31Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.215.176.217:8877/7537757be2b9fd0/output.txt

pdfjsbot · 2020-08-25T22:15:50Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/95d5c41c999ce5e/output.txt

Total script time: 25.30 mins

Lint: Passed
Make references: Passed
Check references: Passed

pdfjsbot · 2020-08-25T22:17:50Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/7537757be2b9fd0/output.txt

Total script time: 27.30 mins

Lint: Passed
Make references: Passed
Check references: Passed

timvandermeij added core viewer form-acroform form-xfa labels Aug 23, 2020

timvandermeij force-pushed the acroform-type-detection branch 4 times, most recently from 8ed7bbe to 9608c5c Compare August 23, 2020 21:15

timvandermeij mentioned this pull request Aug 23, 2020

only warn on xfa if an acroform is not also present #12254

Closed

Snuffleupagus requested changes Aug 24, 2020

View reviewed changes

Snuffleupagus reviewed Aug 24, 2020

View reviewed changes

src/core/document.js Outdated Show resolved Hide resolved

timvandermeij force-pushed the acroform-type-detection branch 2 times, most recently from 6a1db6e to 0d6ab07 Compare August 24, 2020 22:08

Snuffleupagus approved these changes Aug 25, 2020

View reviewed changes

src/core/document.js Show resolved Hide resolved

src/core/document.js Show resolved Hide resolved

src/core/document.js Outdated Show resolved Hide resolved

Snuffleupagus reviewed Aug 25, 2020

View reviewed changes

src/core/document.js Outdated Show resolved Hide resolved

timvandermeij added 6 commits August 25, 2020 23:28

Inline the setup method in the parse method in `src/core/document…

0f229d5

….js` Now that the `parse` method is simplified we can inline the `setup` method in the `parse` method since it's only two lines of code. This avoids some indirection.

timvandermeij force-pushed the acroform-type-detection branch from 0d6ab07 to 0f229d5 Compare August 25, 2020 21:41

timvandermeij merged commit 4ffdbe6 into mozilla:master Aug 25, 2020

timvandermeij deleted the acroform-type-detection branch August 25, 2020 22:18

Snuffleupagus mentioned this pull request Aug 31, 2020

PDF - yellow fallback bar doesn't appear below URL bar on some files/links #12303

Closed

Snuffleupagus mentioned this pull request Oct 16, 2020

Don't store complex data in PDFDocument.formInfo, and replace the fields object with a hasFields boolean instead #12483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve AcroForm/XFA form type detection #12271

Improve AcroForm/XFA form type detection #12271

timvandermeij commented Aug 23, 2020 •

edited

Loading

timvandermeij commented Aug 23, 2020

pdfjsbot commented Aug 23, 2020

pdfjsbot commented Aug 23, 2020

pdfjsbot commented Aug 23, 2020

pdfjsbot commented Aug 23, 2020

Snuffleupagus left a comment

timvandermeij commented Aug 24, 2020

Snuffleupagus left a comment •

edited

Loading

timvandermeij commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

timvandermeij commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

Improve AcroForm/XFA form type detection #12271

Improve AcroForm/XFA form type detection #12271

Conversation

timvandermeij commented Aug 23, 2020 • edited Loading

timvandermeij commented Aug 23, 2020

pdfjsbot commented Aug 23, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented Aug 23, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented Aug 23, 2020

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Aug 23, 2020

From: Bot.io (Windows)

Failed

Snuffleupagus left a comment

Choose a reason for hiding this comment

timvandermeij commented Aug 24, 2020

Snuffleupagus left a comment • edited Loading

Choose a reason for hiding this comment

timvandermeij commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented Aug 25, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented Aug 25, 2020

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Aug 25, 2020

From: Bot.io (Windows)

Success

timvandermeij commented Aug 25, 2020

pdfjsbot commented Aug 25, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented Aug 25, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented Aug 25, 2020

From: Bot.io (Linux m4)

Success

pdfjsbot commented Aug 25, 2020

From: Bot.io (Windows)

Success

timvandermeij commented Aug 23, 2020 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading