At Heron, we’re using AI to automate document processing workflows in financial services and beyond. Each day, we handle over 100,000 documents that need to be quickly identified and categorised before we can kick off the automations.
This repository provides a basic endpoint for classifying files by their filenames. However, the current classifier has limitations when it comes to handling poorly named files, processing larger volumes, and adapting to new industries effectively.
Your task: improve this classifier by adding features and optimisations to handle (1) poorly named files, (2) scaling to new industries, and (3) processing larger volumes of documents.
This is a real-world challenge that allows you to demonstrate your approach to building innovative and scalable AI solutions. We’re excited to see what you come up with! Feel free to take it in any direction you like, but we suggest:
- What are the limitations in the current classifier that's stopping it from scaling?
- How might you extend the classifier with additional technologies, capabilities, or features?
- How can you ensure the classifier is robust and reliable in a production environment?
- How can you deploy the classifier to make it accessible to other services and users?
We encourage you to be creative! Feel free to use any libraries, tools, services, models or frameworks of your choice
- Train a classifier to categorize files based on the text content of a file
- Generate synthetic data to train the classifier on documents from different industries
- Detect file type and handle other file formats (e.g., Word, Excel)
- Set up a CI/CD pipeline for automatic testing and deployment
- Refactor the codebase to make it more maintainable and scalable
- Functionality: Does the classifier work as expected?
- Scalability: Can the classifier scale to new industries and higher volumes?
- Maintainability: Is the codebase well-structured and easy to maintain?
- Creativity: Are there any innovative or creative solutions to the problem?
- Testing: Are there tests to validate the service's functionality?
- Deployment: Is the classifier ready for deployment in a production environment?
-
Clone the repository:
git clone <repository_url> cd heron_classifier
-
Install dependencies:
python -m venv venv source venv/bin/activate pip install -r requirements.txt
-
Run the Flask app:
python -m src.app
-
Test the classifier using a tool like curl:
curl -X POST -F 'file=@path_to_pdf.pdf' http://127.0.0.1:5000/classify_file
-
Run tests:
pytest
Please aim to spend 3 hours on this challenge.
Once completed, submit your solution by sharing a link to your forked repository. Please also provide a brief write-up of your ideas, approach, and any instructions needed to run your solution.