This is a exmaple to build a Named Entity Recognizer(NER) pipeline that:
- fetch data from Amazon Kinesis stream
- process text fields by Stanford CoreNLP and extract entities
- store result to dynamoDB table
- To build, run:
mvn package
- Maven will generate a zip file
kinesis-ner-{version}-SNAPSHOT-package.zip
in thetarget/
folder. The zip file contains all dependencies except Stanford NLP model jar since it is too large.
(Note: To include the NLP models in package, you can remove the provided
scope for coreNLP models in pom.xml before you run mvn package
. Then you can skip step 3.)
-
Downlaod CoreNLP model.
-
Unzip the package in step 2.
-
Make sure your machine have permission to create/read/write Kinesis streams and DynanoDB tables.
-
Make sure all jars are in your
classpath
and run:
java -Xmx1536m -cp {class_path} com.chyikwei.app.KinesisNerApplication
(Note: the process will use ~1GB ram)
- Put some data into the stream. the sample format is json with
uuid
,title
,text
fields. Example:
{
"uuid": "04947df8-0e9e-4471-a2f9-9af509fb5801",
"title": "news title",
"text": "news text"
}
-
Check entities extracted from coreNLP. they will be stored in DynamoDB's
ddb-news-entities
table. -
clean up AWS resources (kinesis stream, dynamoDB tables) after test. (The settings for stream & table names are here)