Skip to content

waltershewmake/ahc

Repository files navigation

FastAPI Classifier on AWS with PostgreSQL

Setup | Parsing Hierarchical Data | Profiling Results | Accuracy | Cost Analysis | Conclusion

Setup

This guide outlines the steps required to set up an AWS environment to run the FastAPI classifier application with a PostgreSQL database using Elastic Beanstalk and RDS.

Prerequisites

  • AWS account
  • AWS CLI installed and configured
  • Elastic Beanstalk CLI (EB CLI) installed
  • AWS IAM user with necessary permissions

Steps

  1. brew install awsebcli or pip install awsebcli
  2. eb init and follow the prompts
  3. eb create and follow the prompts
  4. eb console to open the console in a browser
  5. Create an RDS instance in the same VPC as the Elastic Beanstalk environment
  6. Enable the pgvector extension on the RDS instance by connecting to the database and running the following command:
    CREATE EXTENSION vector;
  7. Add the following environment variables to the Elastic Beanstalk environment:
    OPENAI_API_KEY=<your_openai_api_key>
  8. eb deploy to deploy the application

Parsing Hierarchical Data

Setup environment

The classifier expects a uniform data format in order to provide accurate predictions. There is no mechanism for automatically inserting files into the database on deployment, so you'll need to connect to the remote database from your local machine. The following steps outline how to parse hierarchical data into a format suitable for the classifier:

  1. Create a .env file in the root directory of the project and add the following environment variables:
    RDS_DB_NAME=<your_db_name>
    RDS_USERNAME=<your_db_username>
    RDS_PASSWORD=<your_db_password>
    RDS_HOSTNAME=<your_db_hostname>
    RDS_PORT=<your_db_port>
    OPENAI_API_KEY=<your_openai_api_key>
  2. Create a python virtual environment and install the required dependencies:
    python -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt

Parsing a file

  1. Note the hierarchy type and file structure, and determine if one of the existing transformers can be used (see file_parser/transformers).
  2. If it can, prefix the file name with the hierarchy in all caps, for example, "UN_SPSC_1.csv" If there is no existing transformer capable of parsing the file, you'll need to create a transformer.
  3. Run the file parser using the following command:
    python -m file_parser <file_path>

Create a transformer

The BaseTransformer class was created with flexibility and future expansion in mind, so it should be easy to extend. To create a new transformer, follow these steps:

  1. Create a new file in file_parser/transformers named after your hierarchy.
  2. Create a new transformer class that inherits from BaseTransformer.
  3. Add the hierarchy and transformer names to the class attributes __hierarchy__ and __transformer__.
    • The __hierarchy__ attribute is used to associate the transformer with one of the hierarchies defined in the database.
    • The __transformer__ attribute is used to identify the transformer in the file parser.
  4. Implement the parse method to read the file and transform the data into a list of dictionaries. There are three required fields:
    • name: The unique identifier for the hierarchy item.
    • parent_name: The unique identifier for the parent item.
    • desc: A description of the item used for classification.
  5. Call the try_import method with the list of dictionaries to import the data into the database.
  6. Add your hierarchy to the HierarchyType enum in api/schemas.py.

Example Transformer

# file_parser/transformers/YOUR_TRANSFORMER.py

import csv
from file_parser.base import BaseTransformer


class YOUR_TRANSFORMER(BaseTransformer):
   """
   YOUR_TRANSFORMER Transformer

   part_id -> name
   parent_id -> parent_name
   eccn_desc -> desc
   """

   __hierarchy__ = "YOUR_HIERARCHY"
   __transformer__ = "YOUR_TRANSFORMER"

   def parse(self):
      """Parse a file"""
      if not super().parse():
            return

      rows = []
      with open(self.file_path, "r", encoding="utf-8") as file:
            reader = csv.DictReader(file)
            for row in reader:
               parsed_row = {
                  "name": row.get("part_id").strip(),
                  "parent_name": row.get("parent_id").strip(),
                  "desc": row.get("eccn_desc"),
               }
               rows.append(parsed_row)

      if not rows:
            print("No rows to import")
            return

      # Import the data
      self.try_import(rows)

Profiling Results

Setup

  • DB: AWS RDS PostgreSQL db.t3.xlarge (peaked at 22% CPU, could probably get away with a t3.small)
  • EB: Python 3.9 running on 64bit Amazon Linux 2023/4.0.12
  • EC2: Seven t3.micro instances running behind a load balancer (each capable of 7-12 RPS, so you could get away with five instances for 50 RPS)
  • Classifier: OpenAI text-embedding-3-large

Usage

  • DB: Peaked at 22% CPU usage
  • EB: Peaked at 15% CPU usage avg. across all instances
  • Classifier: Peaked at 49.5 RPS (99% of OpenAI's tier 1 limit)

Results

aws eb health request statistics 1 request statistics 2

Proof of RPS > 10 per worker

peak rps per worker

Observations

After increasing the number of users from 20 to 21, the RPS began to drop off, likely due to the classifier reaching the OpenAI tier 1 limit. The CPU usage on the EB instances remained low, so the bottleneck is likely OpenAI. The RDS instance was able to handle the load without issue, so the next step would be to upgrade the OpenAI plan to increase the RPS limit, and then continue to increase the number of users until the CPU usage maxes out on the EB instances or the RDS instance.

Accuracy

Chart 1

Description

A histogram representing 6000 demo items randomly sampled from a Taxonomy dataset that shows (in blue) the distribution of the cosine distances between the correct leaf node (y) in the hierarchy and the leaf node identified by the classifier (ŷ), and (in orange) the distribution of the cosine distances between the correct leaf node (y) and the randomly selected leaf node (ŷ_rand). 6000 classifications were made in total, and 6000 random leaf nodes were selected.

chart 1 log scale

Observations

  • The median cosine distance between y and ŷ is ..., indicating that the classifier is able to identify the correct leaf node in the hierarchy with high accuracy.
  • The median cosine distance between y and ŷ_rand is ..., indicating that randomly selected leaf nodes are far from the correct leaf node in the hierarchy.

Chart 2

Description

A histogram that shows (in blue) the distribution of the cosine distances between the correct leaf node (y) in the hierarchy and generated embedding of the input item (v).

chart 2 log scale

Cost Analysis

OpenAI Costs

10,778 requests were made, putting the total cost from openAI at about $0.20. This cannot be calculated exactly since we don't know the token amount, but I performed the following calculation to get a good estimate:

5,536,446 tokens used for the day / 39,403 requests made for the day * 10,778 requests in the test = 1,514,397.7612872116 tokens used in the test

text-embedding-3-large price: $0.13/million tokens

Total cost: 1,514,397.7612872116 / 1,000,000 * 0.13 = $0.196871709 (about $0.20)

AWS Costs

The total AWS cost for the day was $1.09. The majority of the cost was due to the RDS instance, which was $0.94.

aws cost

Total Cost

The total cost for the day was $1.29, putting the cost per request at about $0.0001196883, which is an overestimate since the classifier was only used for a fraction of the day, but we included the AWS costs for the entire day.

Conclusion

Initial product requirements:

Database

  • All hierarchies shall be stored identically in a Postgres database
  • Relevant database hierarchies shall be retrieved from the database at the time of request, adhering to the principles of minimal retrieval

Application

Returns

  • The top 5 most likely classifications and their corresponding probabilities
  • P95 response times shall be < 1 sec
  • Concurrency per worker shall be > 10 simultaneous requests
  • Cost per request shall be < $0.01

Hierarchies

  • US Product Tax Codes (US_PTC)
  • US Export Control Classification Numbers (US_ECCN)
  • UN Commodity Codes (UN_SPSC)
  • EU DUal Use Codes (EU_ECCN)

About

Arbitrary Hierarchical Classifier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published