Heron Coding Challenge - File Classifier

Overview

At Heron, we’re using AI to automate document processing workflows in financial services and beyond. Each day, we handle over 100,000 documents that need to be quickly identified and categorised before we can kick off the automations.

This repository provides a basic endpoint for classifying files by their filenames. However, the current classifier has limitations when it comes to handling poorly named files, processing larger volumes, and adapting to new industries effectively.

Your task: improve this classifier by adding features and optimisations to handle (1) poorly named files, (2) scaling to new industries, and (3) processing larger volumes of documents.

This is a real-world challenge that allows you to demonstrate your approach to building innovative and scalable AI solutions. We’re excited to see what you come up with! Feel free to take it in any direction you like, but we suggest:

Part 1: Enhancing the Classifier

What are the limitations in the current classifier that's stopping it from scaling?
How might you extend the classifier with additional technologies, capabilities, or features?

Part 2: Productionising the Classifier

How can you ensure the classifier is robust and reliable in a production environment?
How can you deploy the classifier to make it accessible to other services and users?

We encourage you to be creative! Feel free to use any libraries, tools, services, models or frameworks of your choice

Possible Ideas / Suggestions

Train a classifier to categorize files based on the text content of a file
Generate synthetic data to train the classifier on documents from different industries
Detect file type and handle other file formats (e.g., Word, Excel)
Set up a CI/CD pipeline for automatic testing and deployment
Refactor the codebase to make it more maintainable and scalable

Marking Criteria

Functionality: Does the classifier work as expected?
Scalability: Can the classifier scale to new industries and higher volumes?
Maintainability: Is the codebase well-structured and easy to maintain?
Creativity: Are there any innovative or creative solutions to the problem?
Testing: Are there tests to validate the service's functionality?
Deployment: Is the classifier ready for deployment in a production environment?

Getting Started

Clone the repository:

git clone <repository_url>
cd heron_classifier

Install dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run the Flask app:
```
python -m src.app
```

Test the classifier using a tool like curl:

curl -X POST -F 'file=@path_to_pdf.pdf' http://127.0.0.1:5000/classify_file

Run tests:
```
 pytest
```

Submission

Please aim to spend 3 hours on this challenge.

Once completed, submit your solution by sharing a link to your forked repository. Please also provide a brief write-up of your ideas, approach, and any instructions needed to run your solution.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
files		files
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heron Coding Challenge - File Classifier

Overview

Part 1: Enhancing the Classifier

Part 2: Productionising the Classifier

Possible Ideas / Suggestions

Marking Criteria

Getting Started

Submission

About

Releases

Packages

Contributors 2

Languages

heron-data/join-the-siege

Folders and files

Latest commit

History

Repository files navigation

Heron Coding Challenge - File Classifier

Overview

Part 1: Enhancing the Classifier

Part 2: Productionising the Classifier

Possible Ideas / Suggestions

Marking Criteria

Getting Started

Submission

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages