AWS Sample: Textract and Comprehend Integration via Step Functions and the CDK

This repository is a sample guide to building a serverless document processing application that can make intelligent flow-control decisions after classifying the input document type.

A video presentation of the architecture and a demo is available in my AWS Innovate AI/ML Edition 2021 talk.

Getting Started

These prequisites should be installed first:

Install and configure Apache Maven
Install and configure the AWS Cloud Development Kit
Install and configure git
Install the JDK

Once those are installed, clone this repository:

git clone https://github.com/aws-samples/amazon-textract-comprehend-stepfunctions-example

Then import the project into your IDE, and review the sample code. There are two models, cdk and lambda. The cdk module contains the application and infrastructure CDK code. It is responsible for building and deploying the serverless Lambda functions. It defines the Step Functions flow.

The lambda module contains the application runtime code as Lambda functions. These functions show examples of calling extracting a single page from a PDF and calling Textract synchronously, classifying its content using a Comprehend custom classifier, and an asynchronous Textract call with an AWS SNS ping on completion. The initial flow is triggered by an upload to S3 which starts a Step Functions execution.

Train a sample classifier

Before deploying the sample, you will need to train a Comprehend classifier. A very simple reference example is located in /samples/trainer.csv. (It contains the plain text of three sample documents, one per row; in practice you will train with many more samples each.) To train the classifier:

Visit the Comprehend console (switch to the region you wish to use)
Click Train classifier
Give it a name and check the other details (the defaults are fine to start - use a Multi-class classifier)
Specify the S3 location of the training file (upload it first)
Click Train classifier

Once it's trained (it will take a few minutes), start an endpoint by clicking Create endpoint from the classifier's console page. (You can also use batch mode classification which does not require a running endpoint, but is not available for real-time processing.) Note the ARN of the classifier.

Build the project and build the CDK

From the project root directory, run the command: mvn install to compile the project and create the deployable Lambda artefacts. This will download all of the project dependencies first, so the initial run may take a few minutes. You should re-run this command each time you make an application change.

Once that is successful, you will need to bootstrap the CDK environment:

cdk bootstrap --profile YOUR_AWS_CLI_PROFILE_HERE

Make sure to update the command with the relevant Comprehend ARN and AWS CLI profile.

Permissions

The CDK application needs some permissions to be able to create the stack. Visit the IAM console to set these up. You can run the CDK deploy step iteratively and correct any missing permissions. I defined a Group with these policies and set that to a permission boundary restricted User on an account (to allow it only to self-created resources). You may restrict these further as appropriate.

AmazonComprehendServicePolicy-InnovateClassifier
AWSLambdaFullAccess
IAMFullAccess
AmazonS3FullAccess
AmazonTextractFullAccess
AmazonSNSFullAccess
AWSStepFunctionsFullAccess
AWSCloudFormationFullAccess

Note that these are only build-time permissions, required when running the CDK. The run-time permissions of the application are configured (and more restricted) in the CDK stack.

Deploy the application

Once that's done, run the command:

cdk --context ComprehendArn=YOUR_COMPREHEND_ARN_HERE --profile YOUR_AWS_CLI_PROFILE_HERE deploy

Once completed, you'll see a resource stack like:

textract-comprehend-sample: deploying...                                                                                                  
[0%] start: Publishing 12b0d822082799daf4b7651c9ab809985e721f2575d572943da86f53a055f29b:current                                           
[100%] success: Published 12b0d822082799daf4b7651c9ab809985e721f2575d572943da86f53a055f29b:current                                        
textract-comprehend-sample: creating CloudFormation changeset...                                                                          
  0/11 | 12:53:56 pm | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack       | textract-comprehend-sample User Initiated                
 0/11 Currently in progress: textract-comprehend-sample                                                                                   
  2/11 | 12:54:35 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | FirstPageFunction (FirstPageFunctionBEE9EA11)            
  2/11 | 12:54:35 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | StartTextractFunction (StartTextractFunctionCD0185FF)    
  2/11 | 12:54:36 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | FirstPageFunction (FirstPageFunctionBEE9EA11)            
  2/11 | 12:54:36 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | StartTextractFunction (StartTextractFunctionCD0185FF)    
  3/11 | 12:54:39 pm | UPDATE_IN_PROGRESS   | AWS::StepFunctions::StateMachine | UploadFlow (UploadFlow6C932FD7)                          
  3/11 | 12:54:41 pm | UPDATE_COMPLETE      | AWS::StepFunctions::StateMachine | UploadFlow (UploadFlow6C932FD7)                          
  5/11 | 12:54:45 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | S3UploadListener (S3UploadListener4E242122)              
  5/11 | 12:54:45 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | TextractCompletion (TextractCompletion5AE7AEDD)          
  5/11 | 12:54:46 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | S3UploadListener (S3UploadListener4E242122)              
  5/11 | 12:54:47 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | TextractCompletion (TextractCompletion5AE7AEDD)          
  6/11 | 12:54:49 pm | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack       | textract-comprehend-sample                               
  6/11 | 12:54:50 pm | UPDATE_COMPLETE      | AWS::CloudFormation::Stack       | textract-comprehend-sample                               
                                                                                                                                          
 ✅  textract-comprehend-sample

Running the samples

After deployment, visit the deployed stack in the CloudFormation console. Check the different tabs for details on the stack including the different resources created.

When you upload a PDF (example are in the samples directory) to the source S3 bucket that was created, if everything was set up correctly, a Step Functions flow will be started. View that to see the flow of the document and the Lambda logs.

Congratulations! You have successfully created a scalable, serverless application stack using the CDK to intelligenly process documents on demand.

Resources and pricing

Note that for as long as you have the stack deployed, charges may apply to your account. You should delete the resources (using cdk destroy) when you are done with the sample. You will need to empty the buckets prior to deletion, and also you will need to terminate the Comprehend endpoint.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
cdk		cdk
lambda		lambda
samples		samples
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
cdk.json		cdk.json
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS Sample: Textract and Comprehend Integration via Step Functions and the CDK

Getting Started

Train a sample classifier

Build the project and build the CDK

Permissions

Deploy the application

Running the samples

Resources and pricing

About

Releases

Packages

Contributors 3

Languages

License

aws-samples/amazon-textract-comprehend-stepfunctions-example

Folders and files

Latest commit

History

Repository files navigation

AWS Sample: Textract and Comprehend Integration via Step Functions and the CDK

Getting Started

Train a sample classifier

Build the project and build the CDK

Permissions

Deploy the application

Running the samples

Resources and pricing

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages