In theory, run WindowsSetup.bat to get everything set up for Windows
For Linux or Mac, WindowsSetup.bat contains a list of all dependencies that are needed to be installed.
This program has only been successfully run on Linux. This is due to a depenency with pdf2images not working on Windows
Call python3 extractText.py
Once the GUI is up, select the template file, the output csv file (which will be overwritten) and the PDF file(s) to be scanned in and select run.
The template file is | seperated fields where
The first field is the column header
After that there are three options:
- There are no other fields if nothing is to be inserted for a specific column
- There can be a second field with random text if the column is to always be filled with the same text
- There are four additional fields that specifies the bounding box where the code will extract text from the PDFs:
- The first additional field is the top left x coordinate of the bounding box
- The second additional field is the top left y coordinate of the bounding box
- The third additional field is the bottom right x coordinate of the bounding box
- The fourth additional field is the bottom right y coordinate of the bounding box
Needs to be able to run on Windows (possibly by changing the pdf2image library to a different library?)
Needs to be able to create a template with a GUI
Needs to be able to detect and correct for skewing in the PDFs