Skip to content

Latest commit

 

History

History
220 lines (152 loc) · 10.8 KB

File metadata and controls

220 lines (152 loc) · 10.8 KB

Workshop | Lab 0 | Lab 1 | Lab 2

LAB 3 - Asynchronous - Index documents and entities in Elasticsearch

Amazon Elasticsearch Service is a managed Elasticsearch, the famous search engine based on Lucene library. It enables the indexing of billions of documents and offers near real-time search from those documents. In this lab, we will use it to store the content of our scanned documents and associated entities.

Elasticsearch & Kibana

We first need to setup an Elasticsearch domain (a cluster) and secure the Kibana console with Cognito. The following Cloudformation template will setup everything for you. Just type your email address (use a valid address you can access) and a name for the domain when prompted.

Region Button
us-east-1 Launch stack in us-east-1
eu-west-1 Launch stack in eu-west-1
ap-southeast-1 Launch stack in ap-southeast-1

In the last step, you will need to check several checkboxes to allow the creation of IAM resources:

Capabilities

It may take few minutes to deploy everything (you can have a look at the rest of the lab but you will need resources to be ready to complete it). In the CloudFormation Console, in Outputs tab, you should have the following. Keep these information in safe place for later use (copy past in text document or keep browser tab opened). You should also receive an email with a password to access the Kibana interface.

CloudFormation outputs for Elasticsearch and Role

More details on Cognito authentication for Kibana here.

Architecture

Asynchronous Architecture

In this lab, we will focus on step 9, in which we will index the data in ElasticSearch. See labs 1 and 2 for the previous steps.

Dependencies for the lambda function

As the function will interact with ElasticSearch, we need to provide some libraries. We'll do that using a layer. In Lambda, click on your documentAnalysis function then click on Layers and Add a layer:

Layer

In the new window, select the "ElasticLibs" layer, click Add and don't forget to Save the Lambda.

We'll also need to provide the URL of the ElasticSearch Domain. Scroll down to Environment variables and add the following variable (key: ELASTIC_SEARCH_HOST, value: put the URL you got from CloudFormation):

Environment

Permissions

The function needs permissions to access ElasticSearch. As mentioned above, the domain is currently protected with Cognito. Go to ElasticSearch service console, select your domain, then click on Modify access policy

Elasticsearch console

In the policy editor, we will add permissions (es:ESHttpPost) for the Lambda execution role. Add the following block of JSON to the existing one (within the Statement array:

, 
   {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:role/textract-index-stack-LambdaExecutionRole-12A34B56D78E"
      },
      "Action": "es:ESHttpPost",
      "Resource": "arn:aws:es:us-east-1:111111111111:domain/apollodocumentsearch/*"
    }

a. Replace AWS principal ARN value with the one from your Lambda function. You can find it in your lambda function by clicking the View the TextractApolloWorkshopStack-.... link:

Lambda execution role

b. Change the "111111111" with your account ID (you can see it in the JSON block already available).

c. Replace "apollodocumentsearch" with the name of your Elasticsearch domain created in the stack (see Cloudformation outputs).

At the end, you should have something like that (with your own values), do not copy past this block:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:sts::111111111111:assumed-role/es-stack-CognitoAuthorizedRole-1AB2CD3EF4GH/CognitoIdentityCredentials"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-east-1:111111111111:domain/apollodocumentsearch/*"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:role/textract-index-stack-LambdaExecutionRole-12A34B56D78E"
      },
      "Action": "es:ESHttpPost",
      "Resource": "arn:aws:es:us-east-1:111111111111:domain/apollodocumentsearch/*"
    }
  ]
}

Click Submit on the bottom right of the page and wait few seconds so it is taken into account (Domain status needs to be "Active" again).

Update the documentAnalysis code

Back to your documentAnalysis function Lambda function, in the inline code editor, click File, New file and paste the following code:

import boto3
import urllib
import os
import requests
from requests_aws4auth import AWS4Auth

region = os.environ["AWS_REGION"]
service = "es"
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key,
                   credentials.secret_key,
                   region,
                   service,
                   session_token=credentials.token)

elastic_search_host = os.environ["ELASTIC_SEARCH_HOST"]
index = "documents"
type = "doc"
headers = {"Content-Type": "application/json"}
elastic_url = elastic_search_host + index + '/' + type


class DocumentIndexer():
    def index(self, document):
        """ Index the full document (pages and entities) in elasticsearch """

        response = requests.post(elastic_url,
                                 auth=awsauth,
                                 json=document,
                                 headers=headers)
        response.raise_for_status()

        es_response = response.json()
        return es_response["_id"]

Few things to notice:

  • This class is dedicated to the indexation of the analyzed document. We will use it in the main lambda function.

  • We could also use the Elasticsearch library but as we only do an HTTP POST, we keep it simple and use the Python's Requests Library.

  • We then use the Signature v4 (AWS4Auth) to add Authorization headers to the HTTP request.

  • We retrieve the environment variable containing the Elasticsearch domain URL (os.environ[""]) and build the URL of the index.

  • The rest is pretty straightforward, we do an HTTP POST with the appropriate parameters: URL, authorizations, headers and the document itself.

Once you're comfortable with the code, click File, Save, and use document_indexer.py as filename.

In the lambda_function.py file, add the following code at the top:

from document_indexer import DocumentIndexer
document_indexer = DocumentIndexer()

And the following one at the end of the lambda_handler function:

    doc = {
        "bucket": message['DocumentLocation']['S3Bucket'],
        "document": message['DocumentLocation']['S3ObjectName'],
        "size": len(list(pages.values())),
        "jobId": jobId,
        "pages": list(pages.values()),
        "entities": entities
	}

	print(doc)

	docId = document_indexer.index(doc)

	return {
		"jobId": jobId,
		"docId": docId
	}

Here we build the document that will be indexed: json object containing information regarding the document (bucket and object), the extracted text (pages) and entities found by Comprehend, and finally we index it.

Hit Save in the top right corner of the screen and then click Test. Observe the result in CloudWatch logs.

Then open the url of Kibana (provided in Cloudformation outputs). You will need the password received by email to log on (final dot in the email is not part of the password). Your username is your email address. After the first login to Kibana, You will be asked to change your password.

Click on Discover on upper left, you will be asked to create an index pattern (type "documents", then go to Next step and validate): Kibana index pattern

If you go back to Discover on upper left, you should be able to see the content of the document you've just pushed to S3, plus the different entities and some metadata:

Kibana

You can also see a search bar and use it with the query language to search for something:

Kibana search

You can also upload one of the documents and see the same result.

Congratulations! The full process is done:

  • The content has been extracted by Amazon Textract,
  • Amazon Comprehend extracted the entities,
  • And Amazon Elasticsearch Service indexed it

Once your data is indexed in Elasticsearch, you can create any kind of application that will search data in it.

Exploring further options

In this workshop, we mainly worked with 3 services (Amazon Textract, Amazon Comprehend and Amazon Elasticsearch Service) but you could leverage other services to add more features:

Cleanup

Clean your resources