Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to flatten PDF if widget doesn't have P reference #722

Closed
chislin opened this issue Dec 28, 2020 · 5 comments · Fixed by #723
Closed

Unable to flatten PDF if widget doesn't have P reference #722

chislin opened this issue Dec 28, 2020 · 5 comments · Fixed by #723

Comments

@chislin
Copy link

chislin commented Dec 28, 2020

When running form.flatten() on some PDFs, like from this this example, I get this error:

 Error: Failed to find page undefined for element favorite.superhero
    at _loop_1 (/node_modules/pdf-lib/cjs/api/form/PDFForm.js:463:27)
    at PDFForm.flatten (/node_modules/pdf-lib/cjs/api/form/PDFForm.js:492:17)

Here is an example of PDF that will produce this error and where form isn't on 1st page. (Originally posted here)

@btecu
Copy link
Contributor

btecu commented Dec 28, 2020

I've created a draft pull request with a possible solution. The approach was to find the widget reference and then find the page that has that widget reference. Ideally, I would have preferred to update P() to do all this work, so that we don't have to call another method for fallback, however, the document is required to get the pages and I'm not aware if there's a way to access the document without passing it in. Perhaps @Hopding has other ideas or opinions about the apis?

However, when testing the example document, I am getting a different error, possibly unrelated, something about AcroField. @Hopding / @chislin do you have some time to take a look?

@chislin
Copy link
Author

chislin commented Dec 29, 2020

I've created a draft pull request with a possible solution. The approach was to find the widget reference and then find the page that has that widget reference. Ideally, I would have preferred to update P() to do all this work, so that we don't have to call another method for fallback, however, the document is required to get the pages and I'm not aware if there's a way to access the document without passing it in. Perhaps @Hopding has other ideas or opinions about the apis?

However, when testing the example document, I am getting a different error, possibly unrelated, something about AcroField. @Hopding / @chislin do you have some time to take a look?

Looks good to me, hope @Hopding can take a look at this. Thank you!

@Nitesh-BB
Copy link

Is there any way I can Pass P reference for flattening ?

@Hopding
Copy link
Owner

Hopding commented Jan 2, 2021

Dug into the error @btecu ran into on his branch and found the following:

  • This code:
    const formPdfBytes = fs.readFileSync('/Users/user/Desktop/form_on_3rd.pdf');
    const pdfDoc = await PDFDocument.load(formPdfBytes);
    const form = pdfDoc.getForm();
    form.flatten();
    form.acroForm.getFields();
    ...produces the following error:
    Error: Expected instance of PDFName, but got instance of undefined
      at new UnexpectedObjectTypeError (/Users/user/github/pdf-lib/scratchpad/build/src/core/errors.js:38:24)
      at PDFContext.lookup (/Users/user/github/pdf-lib/scratchpad/build/src/core/PDFContext.js:93:15)
      at createPDFAcroTerminal (/Users/user/github/pdf-lib/scratchpad/build/src/core/acroform/utils.js:69:29)
      at Object.exports.createPDFAcroField (/Users/user/github/pdf-lib/scratchpad/build/src/core/acroform/utils.js:38:12)
      at PDFAcroForm.getFields (/Users/user/github/pdf-lib/scratchpad/build/src/core/acroform/PDFAcroForm.js:25:36)
      at /Users/user/github/pdf-lib/scratchpad/build/scratchpad/index.js:18:31
      at step (/Users/user/github/pdf-lib/node_modules/tslib/tslib.js:139:27)
      at Object.next (/Users/user/github/pdf-lib/node_modules/tslib/tslib.js:120:57)
      at fulfilled (/Users/user/github/pdf-lib/node_modules/tslib/tslib.js:110:62)
    
  • An error is thrown whenever form.acroForm.getFields() is called after form.flatten() is invoked.

The cause of the error seems to be twofold:

  • The PDFForm.flatten() method attempts to remove fields from the form after they have been flattened:

    this.acroForm.removeField(field.ref);

    This is good. The problem is that it only attempts to remove the field from the root-level Fields array. However, the Fields array is actually just the root of a tree structure, made up of terminal and non-terminal nodes. The terminal nodes are the actual fields and are always children of non-terminal nodes. Non-terminal nodes can themselves contain more non-terminals (thereby increasing the depth of the tree) or terminals.

    So removing fields from the root-level non-terminal (the Fields entry of PDFAcroForm.dict) is sufficient only for simple cases where the form does not contain a tree structure. More complicated forms (like the example in the README) that contain tree structures will produce this error when attempting to flatten them. Note that a form must contain a tree structure if there is at least one field whose fully qualified name contains a period (e.g. favorite.superhero).

    The solution to this issue is to use the PDFAcroField.getParent() method to find the flattened field's non-terminal parent node. Then just remove the field from its parent's Kids array. If I recall correctly, fields will return undefined from getParent if they are direct descendents of the form's Fields array. See the following for an example of how to account for this:

    'Kids' in entries ? entries.Kids : entries.Fields,

  • Once the above problem is addressed, a second issue will be introduced. The following logic is executed to convert raw PDFDict instances into the appropriate non-terminal/terminal wrapper class:

    const isNonTerminal = isNonTerminalAcroField(dict);
    if (isNonTerminal) return PDFAcroNonTerminal.fromDict(dict, ref);
    return createPDFAcroTerminal(dict, ref);

    It uses the following logic to determine if a node is terminal or non-terminal (not as straightforward as you might hope):

    // According to the PDF spec:
    //
    // > A field's children in the hierarchy may also include widget annotations
    // > that define its appearance on the page. A field that has children that
    // > are fields is called a non-terminal field. A field that does not have
    // > children that are fields is called a terminal field.
    //
    // The spec is not entirely clear about how to determine whether a given
    // dictionary represents an acrofield or a widget annotation. So we will assume
    // that a dictionary is an acrofield if it is a member of the `/Kids` array
    // and it contains a `/T` entry (widgets do not have `/T` entries). This isn't
    // a bullet proof solution, because the `/T` entry is technically defined as
    // optional for acrofields by the PDF spec. But in practice all acrofields seem
    // to have a `/T` entry defined.
    const isNonTerminalAcroField = (dict: PDFDict): boolean => {

    What will happen is that non-terminal nodes will have all of their fields removed from their Kids array. This will then result in the above logic classifying them as terminal nodes. So, in essence, removing all kids from a non-terminal results in it becoming a terminal. This is unfortunate, but I don't think there's really a way to avoid it. The reason this is an actual problem is that the former non-terminal field will be passed to

    const createPDFAcroTerminal = (dict: PDFDict, ref: PDFRef): PDFAcroTerminal => {
    const ftNameOrRef = getInheritableAttribute(dict, PDFName.of('FT'));
    const type = dict.context.lookup(ftNameOrRef, PDFName);
    to create an appropriate wrapper. However, because it used to be a non-terminal, it will not have a FT property. So the line that attempts to look this up will throw an error.

    I see two ways to solve this problem: (1) Don't assert that the result of an FT lookup will be a PDFName, thereby allowing former non-terminals to fallthrough to the generic PDFAcroTerminal return. Or (2) when the final entry in a non-terminal's Kids array is deleted, remove the non-terminal itself from the tree, thereby avoiding the situation entirely. It would be best to do both of these things.

@btecu
Copy link
Contributor

btecu commented Jan 3, 2021

@Hopding thanks for the detailed explanation. The pull request should be ready for review now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants