This code repository is for a document parser app that can read data from PDF files of receipts or invoices and extract specific details from them, e.g., invoice date, invoice amount.
Show more details
By default, it will extract the following items (if available):
- DATE: The date when the invoice was issued,
- ITEM: The purchased item listed in the invoice,
- AMOUNT: The invoice amount, and
- VENDOR: The name of the company that issued the invoice.
You can easily change these parameters either directly on the web app (before uploading documents) or in the source code.
Show instructions
Once you make a copy of this codebase on your computer, activate a Python virtual environment using the following command:
python -m venv .venv --prompt doc-parser
Once the Python virtual environment is created, activate it and install all dependencies from requirements.txt
.
source .venv/bin/activate
pip install -r requirements.txt
Once all dependencies are installed, you can launch the app using the following command:
streamlit run src/app.py
In a few seconds the app will be lanuched in your browser. If that doesn't happen automatically, you can copy the URL that's printed in the output.
Show config settings
This app makes a call to the OpenAI API. You will need to get the API key from OpenAI and store it locally in the .env
file.
Show instructions
Once the app is launched in a browser, you will see the following list of default parameters:
These are the elements that the app will try to extract from the uploaded documents. You can change these elements if you would like anything different, e.g. invoice number.
You can then upload PDF documents by either clicking on the Browse files button or by draggin and dropping files directly. Please be aware of the size limitation.
Once the files are uploaded, you will get results in a few minutes. Here's a sample result from three receipts:
You can download the results as CSV file by clicking on the Click to Download button.
Show details
Each uploaded PDF document first gets converted into an image (by using pypdfium2
). This is because it's easier to extract text from images rather than from PDF documents.
Then from these images, each line of raw (and messy!) text is extracted (by using pytesseract
).
This raw text is then sent to GPT-3.5 via the OpenAI API with the following prompt:
Where content
is all the extracted text and data_elements
are the default parameters discussed above.
The GPT-3.5 model parses through the text and extracts the requested data elements (as long as they are available). The JSON results are then converted into a pandas dataframe and displayed on the app UI.
Please note that the app uses gpt-3.5-turbo-0613 from OpenAI.
Show details
Of course, this app is far from perfect. Here are some improvements that can enhance the functionality/utility of this app:
- Format all dates and dollar amounts so that they are consistent.
- Enable the user to make changes to the results that are displayed on the UI before exporting. Currently, the user can make changes to the results but they are not persisted to the exported dataset.
- Include some error handling. Currently, there are no proper safeguards against invalid files or when the requested elements are not found in the uploaded files.
Show credits
And finally, my hearfelt thanks to this wonderful video tutorial by AIJason.