Skip to content

A sample guide to building a serverless document processing application that can make intelligent flow-control decisions after classifying the input document type.

License

Notifications You must be signed in to change notification settings

aws-samples/amazon-textract-comprehend-stepfunctions-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

AWS Sample: Textract and Comprehend Integration via Step Functions and the CDK

This repository is a sample guide to building a serverless document processing application that can make intelligent flow-control decisions after classifying the input document type.

A video presentation of the architecture and a demo is available in my AWS Innovate AI/ML Edition 2021 talk.

Getting Started

These prequisites should be installed first:

  1. Install and configure Apache Maven
  2. Install and configure the AWS Cloud Development Kit
  3. Install and configure git
  4. Install the JDK

Once those are installed, clone this repository:

git clone https://github.com/aws-samples/amazon-textract-comprehend-stepfunctions-example

Then import the project into your IDE, and review the sample code. There are two models, cdk and lambda. The cdk module contains the application and infrastructure CDK code. It is responsible for building and deploying the serverless Lambda functions. It defines the Step Functions flow.

The lambda module contains the application runtime code as Lambda functions. These functions show examples of calling extracting a single page from a PDF and calling Textract synchronously, classifying its content using a Comprehend custom classifier, and an asynchronous Textract call with an AWS SNS ping on completion. The initial flow is triggered by an upload to S3 which starts a Step Functions execution.

Train a sample classifier

Before deploying the sample, you will need to train a Comprehend classifier. A very simple reference example is located in /samples/trainer.csv. (It contains the plain text of three sample documents, one per row; in practice you will train with many more samples each.) To train the classifier:

  1. Visit the Comprehend console (switch to the region you wish to use)
  2. Click Train classifier
  3. Give it a name and check the other details (the defaults are fine to start - use a Multi-class classifier)
  4. Specify the S3 location of the training file (upload it first)
  5. Click Train classifier

Once it's trained (it will take a few minutes), start an endpoint by clicking Create endpoint from the classifier's console page. (You can also use batch mode classification which does not require a running endpoint, but is not available for real-time processing.) Note the ARN of the classifier.

Build the project and build the CDK

From the project root directory, run the command: mvn install to compile the project and create the deployable Lambda artefacts. This will download all of the project dependencies first, so the initial run may take a few minutes. You should re-run this command each time you make an application change.

Once that is successful, you will need to bootstrap the CDK environment:

cdk bootstrap --profile YOUR_AWS_CLI_PROFILE_HERE

Make sure to update the command with the relevant Comprehend ARN and AWS CLI profile.

Permissions

The CDK application needs some permissions to be able to create the stack. Visit the IAM console to set these up. You can run the CDK deploy step iteratively and correct any missing permissions. I defined a Group with these policies and set that to a permission boundary restricted User on an account (to allow it only to self-created resources). You may restrict these further as appropriate.

  1. AmazonComprehendServicePolicy-InnovateClassifier
  2. AWSLambdaFullAccess
  3. IAMFullAccess
  4. AmazonS3FullAccess
  5. AmazonTextractFullAccess
  6. AmazonSNSFullAccess
  7. AWSStepFunctionsFullAccess
  8. AWSCloudFormationFullAccess

Note that these are only build-time permissions, required when running the CDK. The run-time permissions of the application are configured (and more restricted) in the CDK stack.

Deploy the application

Once that's done, run the command:

cdk --context ComprehendArn=YOUR_COMPREHEND_ARN_HERE --profile YOUR_AWS_CLI_PROFILE_HERE deploy

Once completed, you'll see a resource stack like:

textract-comprehend-sample: deploying...                                                                                                  
[0%] start: Publishing 12b0d822082799daf4b7651c9ab809985e721f2575d572943da86f53a055f29b:current                                           
[100%] success: Published 12b0d822082799daf4b7651c9ab809985e721f2575d572943da86f53a055f29b:current                                        
textract-comprehend-sample: creating CloudFormation changeset...                                                                          
  0/11 | 12:53:56 pm | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack       | textract-comprehend-sample User Initiated                
 0/11 Currently in progress: textract-comprehend-sample                                                                                   
  2/11 | 12:54:35 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | FirstPageFunction (FirstPageFunctionBEE9EA11)            
  2/11 | 12:54:35 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | StartTextractFunction (StartTextractFunctionCD0185FF)    
  2/11 | 12:54:36 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | FirstPageFunction (FirstPageFunctionBEE9EA11)            
  2/11 | 12:54:36 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | StartTextractFunction (StartTextractFunctionCD0185FF)    
  3/11 | 12:54:39 pm | UPDATE_IN_PROGRESS   | AWS::StepFunctions::StateMachine | UploadFlow (UploadFlow6C932FD7)                          
  3/11 | 12:54:41 pm | UPDATE_COMPLETE      | AWS::StepFunctions::StateMachine | UploadFlow (UploadFlow6C932FD7)                          
  5/11 | 12:54:45 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | S3UploadListener (S3UploadListener4E242122)              
  5/11 | 12:54:45 pm | UPDATE_IN_PROGRESS   | AWS::Lambda::Function            | TextractCompletion (TextractCompletion5AE7AEDD)          
  5/11 | 12:54:46 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | S3UploadListener (S3UploadListener4E242122)              
  5/11 | 12:54:47 pm | UPDATE_COMPLETE      | AWS::Lambda::Function            | TextractCompletion (TextractCompletion5AE7AEDD)          
  6/11 | 12:54:49 pm | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack       | textract-comprehend-sample                               
  6/11 | 12:54:50 pm | UPDATE_COMPLETE      | AWS::CloudFormation::Stack       | textract-comprehend-sample                               
                                                                                                                                          
 ✅  textract-comprehend-sample                                                                                                            

Running the samples

After deployment, visit the deployed stack in the CloudFormation console. Check the different tabs for details on the stack including the different resources created.

When you upload a PDF (example are in the samples directory) to the source S3 bucket that was created, if everything was set up correctly, a Step Functions flow will be started. View that to see the flow of the document and the Lambda logs.

Congratulations! You have successfully created a scalable, serverless application stack using the CDK to intelligenly process documents on demand.

Resources and pricing

Note that for as long as you have the stack deployed, charges may apply to your account. You should delete the resources (using cdk destroy) when you are done with the sample. You will need to empty the buckets prior to deletion, and also you will need to terminate the Comprehend endpoint.

About

A sample guide to building a serverless document processing application that can make intelligent flow-control decisions after classifying the input document type.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages