-
Notifications
You must be signed in to change notification settings - Fork 943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs on AcroForms with recursive field structure #499
Comments
I'll work on this after #497 is done. |
It seems I won't be able to work on this in the foreseeable future, stuff happened. |
Finally found some time! As promised, I'm posting the code that I planned to use for the docs on recursive AcroForms. As is , it extracts interactive recursive forms into nested lists of tuples (name, value) and pretty prints the result. from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.psparser import PSLiteral, PSKeyword
from pdfminer.utils import decode_text
import pprint # only to pretty print output, not needed to decode forms
def decode_value(value):
if isinstance(value, (PSLiteral, PSKeyword)):
value = value.name
if isinstance(value, bytes):
value = decode_text(value)
return value
def get_field(field):
name = field.get('T')
if name is not None:
name = decode_text(name)
forms = field.get('Kids')
if forms:
values = [get_field(resolve1(form)) for form in forms]
return name, values
values = field.get('V')
values = resolve1(values)
if isinstance(values, list):
values = [decode_value(v) for v in values]
else:
values = decode_value(values)
return name, values
fp = r'FR_Y-1520151231_f.pdf' # https://www.ffiec.gov/npw/FinancialReport/FRY15Reports
parser = PDFParser(open(fp, 'rb'))
doc = PDFDocument(parser)
res = resolve1(doc.catalog)
if 'AcroForm' in res:
data = [get_field(field=resolve1(f)) for f in resolve1(res['AcroForm'])['Fields']]
print()
pp = pprint.PrettyPrinter(indent=1, width=160)
pp.pprint(data) The output still needs to be converted to dict {key: value}, as there are some special cases to consider before converting the nested pairs; the forms can have some "funny" cases, ie: This snippet works, but it's not commented (which was the reason I started working on it), but I'm pretty sure it's easy enough to complete it if willing, it's not much different from my previous work. And it decodes the pdf samples fine (as you can see this sample uses one of them). Hope it can be of help. |
Split of from #497 by @typhoon71
Now that the AcroForm howto is merged, I suppose I can consider some other cases.
I waited after the PR got merged because it was becaming difficult for me to keep focused.
AcroForms with recursive fields structure. I did prepare the code to process those, but I'd like to talk a bit on how to format the output, which in some cases can be... funny. Would it be better to discuss it here or to prepare a howto + PR and then discuss on that?
The text was updated successfully, but these errors were encountered: