Skip to content

Heron Coding Challenge - File Classifier

Notifications You must be signed in to change notification settings

heron-data/join-the-siege

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Heron Coding Challenge - File Classifier

Overview

At Heron, we’re using AI to automate document processing workflows in financial services and beyond. Each day, we handle over 100,000 documents that need to be quickly identified and categorised before we can kick off the automations.

This repository provides a basic endpoint for classifying files by their filenames. However, the current classifier has limitations when it comes to handling poorly named files, processing larger volumes, and adapting to new industries effectively.

Your task: improve this classifier by adding features and optimisations to handle (1) poorly named files, (2) scaling to new industries, and (3) processing larger volumes of documents.

This is a real-world challenge that allows you to demonstrate your approach to building innovative and scalable AI solutions. We’re excited to see what you come up with! Feel free to take it in any direction you like, but we suggest:

Part 1: Enhancing the Classifier

  • What are the limitations in the current classifier that's stopping it from scaling?
  • How might you extend the classifier with additional technologies, capabilities, or features?

Part 2: Productionising the Classifier

  • How can you ensure the classifier is robust and reliable in a production environment?
  • How can you deploy the classifier to make it accessible to other services and users?

We encourage you to be creative! Feel free to use any libraries, tools, services, models or frameworks of your choice

Possible Ideas / Suggestions

  • Train a classifier to categorize files based on the text content of a file
  • Generate synthetic data to train the classifier on documents from different industries
  • Detect file type and handle other file formats (e.g., Word, Excel)
  • Set up a CI/CD pipeline for automatic testing and deployment
  • Refactor the codebase to make it more maintainable and scalable

Marking Criteria

  • Functionality: Does the classifier work as expected?
  • Scalability: Can the classifier scale to new industries and higher volumes?
  • Maintainability: Is the codebase well-structured and easy to maintain?
  • Creativity: Are there any innovative or creative solutions to the problem?
  • Testing: Are there tests to validate the service's functionality?
  • Deployment: Is the classifier ready for deployment in a production environment?

Getting Started

  1. Clone the repository:

    git clone <repository_url>
    cd heron_classifier
  2. Install dependencies:

    python -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
  3. Run the Flask app:

    python -m src.app
  4. Test the classifier using a tool like curl:

    curl -X POST -F 'file=@path_to_pdf.pdf' http://127.0.0.1:5000/classify_file
  5. Run tests:

     pytest

Submission

Please aim to spend 3 hours on this challenge.

Once completed, submit your solution by sharing a link to your forked repository. Please also provide a brief write-up of your ideas, approach, and any instructions needed to run your solution.

About

Heron Coding Challenge - File Classifier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages