Skip to content

Latest commit

 

History

History
121 lines (72 loc) · 4.71 KB

README.md

File metadata and controls

121 lines (72 loc) · 4.71 KB

📄 Document Parser for Receipts and Invoices

This code repository is for a document parser app that can read data from PDF files of receipts or invoices and extract specific details from them, e.g., invoice date, invoice amount.

Show more details

By default, it will extract the following items (if available):

  • DATE: The date when the invoice was issued,
  • ITEM: The purchased item listed in the invoice,
  • AMOUNT: The invoice amount, and
  • VENDOR: The name of the company that issued the invoice.

You can easily change these parameters either directly on the web app (before uploading documents) or in the source code.

Instructions to Launch the App 🚀

Show instructions

Once you make a copy of this codebase on your computer, activate a Python virtual environment using the following command:

python -m venv .venv --prompt doc-parser

Once the Python virtual environment is created, activate it and install all dependencies from requirements.txt.

source .venv/bin/activate

pip install -r requirements.txt

Once all dependencies are installed, you can launch the app using the following command:

streamlit run src/app.py

In a few seconds the app will be lanuched in your browser. If that doesn't happen automatically, you can copy the URL that's printed in the output.

Secrets 🔑

Show config settings

This app makes a call to the OpenAI API. You will need to get the API key from OpenAI and store it locally in the .env file.

API Keys

How to Use the App 🤔

Show instructions

Once the app is launched in a browser, you will see the following list of default parameters:

Default Elements

These are the elements that the app will try to extract from the uploaded documents. You can change these elements if you would like anything different, e.g. invoice number.

You can then upload PDF documents by either clicking on the Browse files button or by draggin and dropping files directly. Please be aware of the size limitation.

Upload Documents

Once the files are uploaded, you will get results in a few minutes. Here's a sample result from three receipts:

Sample Results

You can download the results as CSV file by clicking on the Click to Download button.

How It Works ⚙️

Show details

Each uploaded PDF document first gets converted into an image (by using pypdfium2). This is because it's easier to extract text from images rather than from PDF documents.

Then from these images, each line of raw (and messy!) text is extracted (by using pytesseract).

This raw text is then sent to GPT-3.5 via the OpenAI API with the following prompt:

Prompt Template

Where content is all the extracted text and data_elements are the default parameters discussed above.

The GPT-3.5 model parses through the text and extracts the requested data elements (as long as they are available). The JSON results are then converted into a pandas dataframe and displayed on the app UI.

Please note that the app uses gpt-3.5-turbo-0613 from OpenAI.

Potential Improvements 💡

Show details

Of course, this app is far from perfect. Here are some improvements that can enhance the functionality/utility of this app:

  1. Format all dates and dollar amounts so that they are consistent.
  2. Enable the user to make changes to the results that are displayed on the UI before exporting. Currently, the user can make changes to the results but they are not persisted to the exported dataset.
  3. Include some error handling. Currently, there are no proper safeguards against invalid files or when the requested elements are not found in the uploaded files.

Sources 🔎

Show credits

LLM Chain Documentation

pypdfium2 Introduction

pytesseracts PyPI page

And finally, my hearfelt thanks to this wonderful video tutorial by AIJason.