Add docs on AcroForms with recursive field structure #499

pietermarsman · 2020-09-12T13:03:18Z

Now that the AcroForm howto is merged, I suppose I can consider some other cases.
I waited after the PR got merged because it was becaming difficult for me to keep focused.

AcroForms with recursive fields structure. I did prepare the code to process those, but I'd like to talk a bit on how to format the output, which in some cases can be... funny. Would it be better to discuss it here or to prepare a howto + PR and then discuss on that?

typhoon71 · 2020-09-15T15:25:48Z

I'll work on this after #497 is done.

typhoon71 · 2021-01-23T09:24:04Z

It seems I won't be able to work on this in the foreseeable future, stuff happened.
One thing I could do is post the code I prepared, which was working (but is not commented).

typhoon71 · 2021-03-03T17:33:02Z

Finally found some time!

As promised, I'm posting the code that I planned to use for the docs on recursive AcroForms.

As is , it extracts interactive recursive forms into nested lists of tuples (name, value) and pretty prints the result.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.psparser import PSLiteral, PSKeyword
from pdfminer.utils import decode_text

import pprint  # only to pretty print output, not needed to decode forms


def decode_value(value):
    if isinstance(value, (PSLiteral, PSKeyword)):
        value = value.name
    if isinstance(value, bytes):
        value = decode_text(value)
    return value


def get_field(field):

    name = field.get('T')
    if name is not None:
        name = decode_text(name)

    forms = field.get('Kids')
    if forms:
        values = [get_field(resolve1(form)) for form in forms]
        return name, values

    values = field.get('V')
    values = resolve1(values)

    if isinstance(values, list):
        values = [decode_value(v) for v in values]
    else:
        values = decode_value(values)

    return name, values


fp = r'FR_Y-1520151231_f.pdf'  # https://www.ffiec.gov/npw/FinancialReport/FRY15Reports

parser = PDFParser(open(fp, 'rb'))
doc = PDFDocument(parser)
res = resolve1(doc.catalog)

if 'AcroForm' in res:
    data = [get_field(field=resolve1(f)) for f in resolve1(res['AcroForm'])['Fields']]

    print()

    pp = pprint.PrettyPrinter(indent=1, width=160)
    pp.pprint(data)

The output still needs to be converted to dict {key: value}, as there are some special cases to consider before converting the nested pairs; the forms can have some "funny" cases, ie: [(None, None), (None, None), (None, None)]

This snippet works, but it's not commented (which was the reason I started working on it), but I'm pretty sure it's easy enough to complete it if willing, it's not much different from my previous work.

And it decodes the pdf samples fine (as you can see this sample uses one of them).

Hope it can be of help.

pietermarsman added the type: documentation Related to the documentation label Sep 12, 2020

typhoon71 mentioned this issue Sep 15, 2020

DOCS: some more about interactive forms #497

Open

datatalking mentioned this issue Jul 20, 2022

Type Error during extracting pages in some pdfs #720

Closed

pietermarsman added the status: accepted label Aug 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs on AcroForms with recursive field structure #499

Add docs on AcroForms with recursive field structure #499

pietermarsman commented Sep 12, 2020 •

edited

Loading

typhoon71 commented Sep 15, 2020

typhoon71 commented Jan 23, 2021

typhoon71 commented Mar 3, 2021

Add docs on AcroForms with recursive field structure #499

Add docs on AcroForms with recursive field structure #499

Comments

pietermarsman commented Sep 12, 2020 • edited Loading

typhoon71 commented Sep 15, 2020

typhoon71 commented Jan 23, 2021

typhoon71 commented Mar 3, 2021

pietermarsman commented Sep 12, 2020 •

edited

Loading