This repository contains sample library and code examples showing how Amazon Textract can be used to extract text from documents and generate searchable pdf documents.
To generate a searchable PDF, we use Amazon Textract to extract text from documents and then add extracted text as a layer to the image in the PDF document. Amazon Textract detect and analyze text input documents and returns information about detected items such as pages, words, lines, form data (key-value pairs), tables, selection elements etc. It also provides bounding box information which is an axis-aligned coarse representation of the location of the recognized item on the document page. We use detected text and its bounding box information to appropriately place text in the pdf page.
SampleInput.pdf is an example input document where text is locked inside the image. SampleOutput.pdf is an example of a searchable pdf document where you can select and copy text and search within the document.
PDFDocument library wraps all the necessary logic to generate searchable PDF document using output from Amazon Textract. It also uses open source Java library Apache PDFBox to create the PDF document but there similar pdf processing libraries available in other programming languages.
...
//Extract text using Amazon Textract
List<TextLine> lines = extractText(imageBytes);
//Create new pdf document
PDFDocument pdfDocument = new PDFDocument();
//Add page with text layer and image in the pdf document
pdfDocument.addPage(image, imageType, lines);
//Save PDF to local disk
try(OutputStream outputStream = new FileOutputStream(outputDocumentName)) {
pdfDocument.save(outputStream);
pdfDocument.close();
}
Sample project has five different examples:
- Create searchable PDF from image on local drive
- Create searchable PDF from pdf on local drive
- Create searchable PDF from image in Amazon S3 bucket
- Create searchable PDF from pdf in Amazon S3 bucket
- Create searchable PDF from pdf in Amazon S3 bucket - by appending input document
- Setup AWS Account and AWS CLI using getting started with Amazon Textract.
- Download and unzip the sample project.
- Install Apache Maven if it is not already installed.
- In the project directory run "mvn package".
- Run: "java -cp target/searchable-pdf-1.0.jar Demo" to run Java project with Demo as main class.
By default only first example to create searchable PDF from image on local drive is enabled. Uncomment relevant lines in Demo to run other examples.
- Download and unzip the sample project.
- Install Apache Maven if it is not already installed.
- In the project directory run "mvn package".
The build creates .jar in project-dir/target/searchable-pdf1.0.jar, using information in the pom.xml to do the necessary transforms. This is a standalone .jar (.zip file) that includes all the dependencies. This is your deployment package that you can upload to AWS Lambda to create a Lambda function. DemoLambda has all the necessary code to read S3 events and take action based on the type of input document.
-
Create an Amazon S3 bucket.
-
Create a folder “documents” in Amazon S3 bucket.
-
Create an AWS Lambda with Java 8 and IAM role that has read and write permissions to S3 bucket you created earlier.
-
Configure the IAM role to have permissions to call Amazon Textract.
-
Set handler to "DemoLambda::handleRequest".
-
Increase timeout to 5 minutes.
-
Upload jar file you build earlier.
-
Add a trigger in the Lambda function such that when an object is uploaded to the folder “documents” in your Amazon S3 bucket, Lambda function gets executed.
Make sure that you set trigger for “documents” folder. If you add trigger for the whole bucket then Lambda will trigger every time an output pdf document is generated resulting in cycle.
- Upload an image (jpeg, png) or pdf document to documents folder in your Amazon S3 bucket.
In few seconds you should see searchable pdf document generated in the S3 bucket.
These steps show simple Amazon S3 and Lambda integration. In production you should consider scalable architecture similar to this reference architecture.
- As you run these samples they call different Amazon Textract APIs in your AWS account. You will get charged for all the API calls made as part of the analysis.
- Large scale document processing with Amazon Textract - Reference Architecture
- Amazon Textract code samples
- Batch processing tool
- JSON response parser
This library is licensed under the MIT-0 License. See the LICENSE file.