Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs on AcroForms with recursive field structure #499

Open
pietermarsman opened this issue Sep 12, 2020 · 3 comments
Open

Add docs on AcroForms with recursive field structure #499

pietermarsman opened this issue Sep 12, 2020 · 3 comments
Labels
status: accepted type: documentation Related to the documentation

Comments

@pietermarsman
Copy link
Member

pietermarsman commented Sep 12, 2020

Split of from #497 by @typhoon71

Now that the AcroForm howto is merged, I suppose I can consider some other cases.
I waited after the PR got merged because it was becaming difficult for me to keep focused.

AcroForms with recursive fields structure. I did prepare the code to process those, but I'd like to talk a bit on how to format the output, which in some cases can be... funny. Would it be better to discuss it here or to prepare a howto + PR and then discuss on that?

@typhoon71
Copy link
Contributor

I'll work on this after #497 is done.

@typhoon71
Copy link
Contributor

It seems I won't be able to work on this in the foreseeable future, stuff happened.
One thing I could do is post the code I prepared, which was working (but is not commented).

@typhoon71
Copy link
Contributor

Finally found some time!

As promised, I'm posting the code that I planned to use for the docs on recursive AcroForms.

As is , it extracts interactive recursive forms into nested lists of tuples (name, value) and pretty prints the result.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.psparser import PSLiteral, PSKeyword
from pdfminer.utils import decode_text

import pprint  # only to pretty print output, not needed to decode forms


def decode_value(value):
    if isinstance(value, (PSLiteral, PSKeyword)):
        value = value.name
    if isinstance(value, bytes):
        value = decode_text(value)
    return value


def get_field(field):

    name = field.get('T')
    if name is not None:
        name = decode_text(name)

    forms = field.get('Kids')
    if forms:
        values = [get_field(resolve1(form)) for form in forms]
        return name, values

    values = field.get('V')
    values = resolve1(values)

    if isinstance(values, list):
        values = [decode_value(v) for v in values]
    else:
        values = decode_value(values)

    return name, values


fp = r'FR_Y-1520151231_f.pdf'  # https://www.ffiec.gov/npw/FinancialReport/FRY15Reports

parser = PDFParser(open(fp, 'rb'))
doc = PDFDocument(parser)
res = resolve1(doc.catalog)

if 'AcroForm' in res:
    data = [get_field(field=resolve1(f)) for f in resolve1(res['AcroForm'])['Fields']]

    print()

    pp = pprint.PrettyPrinter(indent=1, width=160)
    pp.pprint(data)

The output still needs to be converted to dict {key: value}, as there are some special cases to consider before converting the nested pairs; the forms can have some "funny" cases, ie: [(None, None), (None, None), (None, None)]

This snippet works, but it's not commented (which was the reason I started working on it), but I'm pretty sure it's easy enough to complete it if willing, it's not much different from my previous work.

And it decodes the pdf samples fine (as you can see this sample uses one of them).

Hope it can be of help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: accepted type: documentation Related to the documentation
Projects
None yet
Development

No branches or pull requests

2 participants