Add initial Docker container

This commit adds an initial Whisper Docker container, along with program run.py that pulls job files and media from an "todo" AWS SQS and S3 bucket respectively, and writes the Whisper output back to the bucket while placing a "done" message in another queue. See README.md for the details.
sul-dlss · Sep 24, 2024 · 5098187 · 5098187
1 parent dcd2d05
commit 5098187
Show file tree

Hide file tree

Showing 12 changed files with 489 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,4 @@
+.venv
+.env
+__pycache__/
+whisper_models
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,26 @@
+FROM ubuntu:22.04
+
+ENV AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
+ENV AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
+ENV AWS_REGION=$AWS_REGION
+ENV AWS_ROLE_ARN=$AWS_ROLE_ARN
+ENV SPEECH_TO_TEXT_S3_BUCKET=$SPEECH_TO_TEXT_S3_BUCKET
+
+RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
+    sudo \
+    python3.11 \
+    python3-distutils \
+    python3-pip \
+    ffmpeg
+
+WORKDIR /app
+
+ADD ./whisper_models whisper_models
+ADD ./requirements.txt requirements.txt
+
+RUN python3.11 -m pip install --upgrade pip
+RUN python3.11 -m pip install -r requirements.txt
+
+ADD ./run.py run.py
+
+CMD ["python3.11", "run.py"]
diff --git a/README.md b/README.md
@@ -1,3 +1,122 @@
-# SUL Speech to Text Tools
+# speech-to-text
 
-For now, this is a placeholder repo where we can ticket some things that don't yet have a natural home (and which may end up living in this repo after prototyping/implementation, e.g. a definition for the Docker container we run on cloud based GPU instances, supporting tools and docs, etc).
+This repository contains a Docker configuration for performing serverless speech-to-text processing with Whisper using an Amazon S3 bucket for coordinating work.
+
+## Build
+
+To build the container you will need to first download the pytorch models that Whisper uses. This is about 13GB of data and can take some time! The idea here is to have the container come with the models baked in, so it doesn't need to fetch them dynamically every time the container runs. If you know you only need one size model, and want to just include that then edit the `whisper_models/urls.txt` file accordingly before running the `wget` command.
+
+```shell
+wget --directory-prefix whisper_models --input-file whisper_models/urls.txt
+```
+
+Then you can build the image:
+
+```shell
+docker build --tag sul-speech-to-text .
+```
+
+## Configure AWS
+
+Create two queues, one for new jobs, and one for completed jobs:
+
+```shell
+$ aws sqs create-queue --queue-name sul-speech-to-text-todo
+$ aws sqs create-queue --queue-name sul-speech-to-text-done
+```
+
+Create a bucket: 
+
+```shell
+aws s3 mb s3://sul-speech-to-text
+```
+
+Configure `.env` with your AWS credentials so the Docker container can find them:
+
+```shell
+cp env-example .env
+vi .env
+```
+
+## Create a Job
+
+Typically common-accessioning robots will initiate new work by:
+
+1. minting a new job ID
+2. copying the media file to the S3 bucket
+3. putting a job in the TODO queue.
+
+For testing you can simulate these things by running:
+
+```shell
+python3 run.py create
+```
+
+## Run
+
+Now you can run the container and have it pick up the job you placed into the queue:
+
+```shell
+docker run --env-file .env sul-speech-to-text
+```
+
+Wait for the results to appear:
+
+```shell
+aws ls s3://sul-speech-to-text/out/${JOB_ID}/
+```
+
+## The Job File
+
+The job file is a JSON object that contains information about how to run Whisper. Minimally it contains the Job ID,  and what media files to process using the service defaults:
+
+```json
+{
+  "id": "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F",
+  "druid": "gy983cn1444",
+  "media": [
+    "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F.mp4"
+  ]
+}
+```
+
+You can also pass in options for Whisper:
+
+```json
+{
+  "id": "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F",
+  "druid": "gy983cn1444",
+  "media": [
+    "8EB51B59-BDFF-4507-B1AA-0DE91ACA388F.mp4"
+  ],
+  "options": {
+    "model": "large",
+    "max_line_count": 80,
+    "beam_size": 10
+  }
+}
+```
+
+## Testing
+
+To run the tests you want to:
+
+Create a virtual environment, and activate it:
+
+```shell
+python -mvenv .venv
+source .venv/bin/activate
+```
+
+Install the dependencies:
+
+```shell
+pip install -r requirements.txt
+pip install -r requirements-dev.txt
+```
+
+Run the tests, which will also build and run the Docker container:
+
+```shell
+pytest
+```
diff --git a/env-example b/env-example
@@ -0,0 +1,5 @@
+AWS_ACCESS_KEY_ID=CHANGE_ME
+AWS_SECRET_ACCESS_KEY=CHANGE_ME
+AWS_REGION=us-west-2
+AWS_ROLE_ARN=arn:aws:iam::418214828013:role/DevelopersRole
+SPEECH_TO_TEXT_S3_BUCKET="sul-speech-to-text-dev-YOUR-USERNAME"
diff --git a/pytest.ini b/pytest.ini
@@ -0,0 +1,2 @@
+[pytest]
+pythonpath = .
diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -0,0 +1,2 @@
+pytest
+python-dotenv
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,3 @@
+boto3
+openai-whisper
+python-dotenv[cli]