This project was developed following the Codebasics Python course. If you're interested, you can find the course here.
Health insurance companies need to adhere to government regulations when processing claims. This involves handling images of patient details and prescriptions sent by hospitals or individual doctors to extract relevant information. Currently, many insurance companies outsource this task to firms like "Mr. X Data Analytics," where employees manually transcribe the information from scanned images.
Mr. X Data Analytics uses software that displays scanned images on one side and requires manual data entry on the other. This process is prone to errors and becomes unmanageable during high-volume periods, such as pandemics. Additionally, insurance companies demand that the extracted data be provided within 24 hours, prompting Mr. X Data Analytics to seek an automated solution.
To address these challenges, we developed a program to automatically extract data from images. While automation cannot entirely replace human oversight, this solution significantly reduces the manual effort required. A human will still review the extracted data to ensure accuracy before submission.
We use Python, the pytesseract library for OCR, and regular expressions for data processing to achieve this.
- Python
- Object-Oriented Programming (OOP)
- pdf2image module
- OpenCV
- pytesseract
- Regular Expressions
- pytest
- Postman
- FastAPI
We use the pdf2image library to convert PDF documents to images for processing.
Initial attempts to extract data from raw images were unsuccessful due to formatting issues, resulting in inaccurate data extraction.
Dr John Smith, M.D
2 Non-Important Street,
New York, Phone (000)-111-2222
Name: Maria Sharapova Date: 5/11/2022
Address: 9 tennis court, new Russia, DC
Prednisone 20 mg
Lialda 2.4 gram
3 days,
or 1 month
To improve accuracy, we preprocess the images using the OpenCV library. We first applied normal thresholding, which failed in areas with shadows or noise, leading to data loss.
We then adopted adaptive thresholding, which divides the image into sub-images and applies different threshold values to each, yielding much better results.
Dr John Smith, M.D
2 Non-Important Street,
New York, Phone (000)-111-2222
Name: Marta Sharapova Date: 5/11/2022
Address: 9 tennis court, new Russia, DC
Prednisone 20 mg
Lialda 2.4 gram
Directions:
Prednisone, Taper 5 mg every 3 days,
Finish in 2.5 weeks
Lialda - take 2 pill everyday for 1 month
For all these above trials, used Jupyter Notebooks and developed small bits of functionality, which can be used later while designing the class.
The code follows Object-Oriented Programming principles for extracting data from medical documents.
We used regular expressions to match and extract patterns from the medical documents. Patterns were tested and refined using,
We employed test-driven development using the pytest module, writing test cases for each method and verifying functionality during development.
The project server is hosted using FastAPI, which offers several advantages:
- Built-in data validation
- Automatically generated documentation
- High performance
Since this is a backend project, we used Postman to test the server's HTTP responses.
This backend functionality can be integrated into the Mr.X Analytics existing software and data can be extracted automatically. The extracted data may have some errors, the person who is performing the work has to correct it and submit the response
- Mr.X Analytics can save at least of 30 secs for each document. It is small amount of time when looking for one document, but cumulatively it can save a tremendous amount of time which can help the company to complete more documents within the given time and make more profit
- The company doesn't have to hire extra people in the season time. As it is a combination of automation and manual the error will be very much low.