Unable to flatten PDF if widget doesn't have P reference #722

chislin · 2020-12-28T12:44:28Z

When running form.flatten() on some PDFs, like from this this example, I get this error:

 Error: Failed to find page undefined for element favorite.superhero
    at _loop_1 (/node_modules/pdf-lib/cjs/api/form/PDFForm.js:463:27)
    at PDFForm.flatten (/node_modules/pdf-lib/cjs/api/form/PDFForm.js:492:17)

Here is an example of PDF that will produce this error and where form isn't on 1st page. (Originally posted here)

The text was updated successfully, but these errors were encountered:

btecu · 2020-12-28T15:00:50Z

I've created a draft pull request with a possible solution. The approach was to find the widget reference and then find the page that has that widget reference. Ideally, I would have preferred to update P() to do all this work, so that we don't have to call another method for fallback, however, the document is required to get the pages and I'm not aware if there's a way to access the document without passing it in. Perhaps @Hopding has other ideas or opinions about the apis?

However, when testing the example document, I am getting a different error, possibly unrelated, something about AcroField. @Hopding / @chislin do you have some time to take a look?

chislin · 2020-12-29T15:02:43Z

I've created a draft pull request with a possible solution. The approach was to find the widget reference and then find the page that has that widget reference. Ideally, I would have preferred to update P() to do all this work, so that we don't have to call another method for fallback, however, the document is required to get the pages and I'm not aware if there's a way to access the document without passing it in. Perhaps @Hopding has other ideas or opinions about the apis?

However, when testing the example document, I am getting a different error, possibly unrelated, something about AcroField. @Hopding / @chislin do you have some time to take a look?

Looks good to me, hope @Hopding can take a look at this. Thank you!

Nitesh-BB · 2020-12-31T03:50:45Z

Is there any way I can Pass P reference for flattening ?

Hopding · 2021-01-02T18:18:40Z

Dug into the error @btecu ran into on his branch and found the following:

This code:

const formPdfBytes = fs.readFileSync('/Users/user/Desktop/form_on_3rd.pdf');
const pdfDoc = await PDFDocument.load(formPdfBytes);
const form = pdfDoc.getForm();
form.flatten();
form.acroForm.getFields();

...produces the following error:

Error: Expected instance of PDFName, but got instance of undefined
  at new UnexpectedObjectTypeError (/Users/user/github/pdf-lib/scratchpad/build/src/core/errors.js:38:24)
  at PDFContext.lookup (/Users/user/github/pdf-lib/scratchpad/build/src/core/PDFContext.js:93:15)
  at createPDFAcroTerminal (/Users/user/github/pdf-lib/scratchpad/build/src/core/acroform/utils.js:69:29)
  at Object.exports.createPDFAcroField (/Users/user/github/pdf-lib/scratchpad/build/src/core/acroform/utils.js:38:12)
  at PDFAcroForm.getFields (/Users/user/github/pdf-lib/scratchpad/build/src/core/acroform/PDFAcroForm.js:25:36)
  at /Users/user/github/pdf-lib/scratchpad/build/scratchpad/index.js:18:31
  at step (/Users/user/github/pdf-lib/node_modules/tslib/tslib.js:139:27)
  at Object.next (/Users/user/github/pdf-lib/node_modules/tslib/tslib.js:120:57)
  at fulfilled (/Users/user/github/pdf-lib/node_modules/tslib/tslib.js:110:62)

An error is thrown whenever form.acroForm.getFields() is called after form.flatten() is invoked.

The cause of the error seems to be twofold:

The PDFForm.flatten() method attempts to remove fields from the form after they have been flattened:

pdf-lib/src/api/form/PDFForm.ts

Line 594 in 2b4ee6b

this.acroForm.removeField(field.ref);

This is good. The problem is that it only attempts to remove the field from the root-level Fields array. However, the Fields array is actually just the root of a tree structure, made up of terminal and non-terminal nodes. The terminal nodes are the actual fields and are always children of non-terminal nodes. Non-terminal nodes can themselves contain more non-terminals (thereby increasing the depth of the tree) or terminals.

So removing fields from the root-level non-terminal (the Fields entry of PDFAcroForm.dict) is sufficient only for simple cases where the form does not contain a tree structure. More complicated forms (like the example in the README) that contain tree structures will produce this error when attempting to flatten them. Note that a form must contain a tree structure if there is at least one field whose fully qualified name contains a period (e.g. favorite.superhero).

The solution to this issue is to use the PDFAcroField.getParent() method to find the flattened field's non-terminal parent node. Then just remove the field from its parent's Kids array. If I recall correctly, fields will return undefined from getParent if they are direct descendents of the form's Fields array. See the following for an example of how to account for this:

pdf-lib/src/api/form/PDFForm.ts

Line 789 in 2b4ee6b

'Kids' in entries ? entries.Kids : entries.Fields,

Once the above problem is addressed, a second issue will be introduced. The following logic is executed to convert raw PDFDict instances into the appropriate non-terminal/terminal wrapper class:

pdf-lib/src/core/acroform/utils.ts

Lines 44 to 46 in 2b4ee6b

    
           const isNonTerminal = isNonTerminalAcroField(dict); 
        
           if (isNonTerminal) return PDFAcroNonTerminal.fromDict(dict, ref); 
        
           return createPDFAcroTerminal(dict, ref);

It uses the following logic to determine if a node is terminal or non-terminal (not as straightforward as you might hope):

pdf-lib/src/core/acroform/utils.ts

Lines 51 to 65 in 2b4ee6b

    
           // According to the PDF spec: 
        
           // 
        
           //   > A field's children in the hierarchy may also include widget annotations 
        
           //   > that define its appearance on the page. A field that has children that 
        
           //   > are fields is called a non-terminal field. A field that does not have 
        
           //   > children that are fields is called a terminal field. 
        
           // 
        
           // The spec is not entirely clear about how to determine whether a given 
        
           // dictionary represents an acrofield or a widget annotation. So we will assume 
        
           // that a dictionary is an acrofield if it is a member of the `/Kids` array 
        
           // and it contains a `/T` entry (widgets do not have `/T` entries). This isn't 
        
           // a bullet proof solution, because the `/T` entry is technically defined as 
        
           // optional for acrofields by the PDF spec. But in practice all acrofields seem 
        
           // to have a `/T` entry defined. 
        
           const isNonTerminalAcroField = (dict: PDFDict): boolean => {

What will happen is that non-terminal nodes will have all of their fields removed from their Kids array. This will then result in the above logic classifying them as terminal nodes. So, in essence, removing all kids from a non-terminal results in it becoming a terminal. This is unfortunate, but I don't think there's really a way to avoid it. The reason this is an actual problem is that the former non-terminal field will be passed to

pdf-lib/src/core/acroform/utils.ts

Lines 79 to 81 in 2b4ee6b

    
           const createPDFAcroTerminal = (dict: PDFDict, ref: PDFRef): PDFAcroTerminal => { 
        
             const ftNameOrRef = getInheritableAttribute(dict, PDFName.of('FT')); 
        
             const type = dict.context.lookup(ftNameOrRef, PDFName);

to create an appropriate wrapper. However, because it used to be a non-terminal, it will not have a FT property. So the line that attempts to look this up will throw an error.

I see two ways to solve this problem: (1) Don't assert that the result of an FT lookup will be a PDFName, thereby allowing former non-terminals to fallthrough to the generic PDFAcroTerminal return. Or (2) when the final entry in a non-terminal's Kids array is deleted, remove the non-terminal itself from the tree, thereby avoiding the situation entirely. It would be best to do both of these things.

btecu · 2021-01-03T14:33:27Z

@Hopding thanks for the detailed explanation. The pull request should be ready for review now.

btecu mentioned this issue Dec 28, 2020

Fallback to finding the page using the reference #723

Merged

chislin closed this as completed Jan 14, 2021

renovate bot mentioned this issue Jun 6, 2021

fix(deps): update dependency pdf-lib to v1.17.1 konnectors/attestation#18

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to flatten PDF if widget doesn't have P reference #722

Unable to flatten PDF if widget doesn't have P reference #722

chislin commented Dec 28, 2020

btecu commented Dec 28, 2020

chislin commented Dec 29, 2020 •

edited

Loading

Nitesh-BB commented Dec 31, 2020

Hopding commented Jan 2, 2021

btecu commented Jan 3, 2021

Unable to flatten PDF if widget doesn't have P reference #722

Unable to flatten PDF if widget doesn't have P reference #722

Comments

chislin commented Dec 28, 2020

btecu commented Dec 28, 2020

chislin commented Dec 29, 2020 • edited Loading

Nitesh-BB commented Dec 31, 2020

Hopding commented Jan 2, 2021

btecu commented Jan 3, 2021

chislin commented Dec 29, 2020 •

edited

Loading