pdfresearch

Search and extract text from PDF Files writes output to csv:

usage: pdfresearch.py [-h] [-c CODEC]
                      [--generate-usermodule GENERATE_USRMODULE] [-o OUTPUT]
                      [--password PASSWORD] [--pagenos F [F ...]]
                      [--maxpages MAXPAGES] [--extract-text]
                      [input] [user_module]
pdfresearch.py: error: positional arguments required [input] [user_module]

first generate an example user module:

$python pdfresearch.py --generate-usermodule mysearch.py

and extract some text from the pdf with:

$python pdfresearch.py <your pdf file>.pdf --extract-text --maxpages 4

now edit mysearch.py, adecuate it to your needs (use the text extracted in the previous step to create the regex): you can test your regular expresions using this online tool

import re

from research import Search

# search is a list of Search objects
search = [
    Search('DEMOGRAFICO',  # this is the label of this search

           # regex list, regular expresion capture groups are important to extract information,
           # capture groups in regular expresion are the values enclosed in (),
           #  you can test your patterns on http://pythex.org
           # this regex match the text Apreciable <name>:\n\nBienvenido captures the name of the subject
           [r'Apreciable\s(.*)\:\n\n\¡Bienvenido\!'],

           flags=re.MULTILINE,  # optional: flags to pass to re.search

           # optional if you want to store a found value in a global store, this value can be retrieved by any next Search instance
           # like this self.context[key], key also can be a lambda expresion returning the key for example:
           # store_actions = { lambda grps : grps[0][1] : lambda grps: grps[0][0]}
           store_actions={'last_matched_name': lambda grps: grps[0][0]},

           # optional specify how to build the output csv row
           # in this case the label, filename, page, and 2 values captured by the regular expresion are used
           output_map=lambda self: (
               self.label, self.context['file'], self.context['page'], self.groups[0][0], '')
           ),
    # another example
    Search('POLIZA',
           [r'NUMERO\sDE\sPOLIZA\n([A-Z0-9]+)\n', r'SEGURO\sDE\sHOSPITALIZACIÓN',
            r'\n\n(.*)\n\w{4}\d{6}(?:[\w\d]{3}|\n)'],
           flags=re.MULTILINE,
           output_map=lambda self: (
               self.label, self.context['file'], self.context['page'], self.groups[2][0], self.groups[0][0])
           ),
]

and then run the search:

$python pdfresearch.py <your pdf file>.pdf mysearch

(note: not .py extension in user_module [mysearch])

example output:

"P1","my.pdf","0","XXXXX0000543308","9920"
"P2","my.pdf","1","XXXXX0000543308",""
"P3","my.pdf","2","XXXXX0000543308",""
"P4","my.pdf","3","XXXXX0000543308",""

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
README.md		README.md
pdfresearch.py		pdfresearch.py
pip-selfcheck.json		pip-selfcheck.json
pyvenv.cfg		pyvenv.cfg
requirements.txt		requirements.txt
research.py		research.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pdfresearch

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Mexarm/pdfresearch

Folders and files

Latest commit

History

Repository files navigation

pdfresearch

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages