Using AWS comprehend as source of models #233

AdityaReddyY · 2022-01-14T11:36:30Z

AdityaReddyY
Jan 14, 2022

First of all, Thank you so much for open-sourcing this library. I am planning to leverage this library and tried to invoke it from AWS lambda. As called out in the README, when using Stanford NLP models the memory size goes north of 400MB crossing the permitted limits from AWS Lambda end. I tried to use AWS comprehend as suggested. Below is how my pom.xml looks like

<!-- https://mvnrepository.com/artifact/io.whelk.flesch.kincaid/whelk-flesch-kincaid -->
        <dependency>
            <groupId>io.whelk.flesch.kincaid</groupId>
            <artifactId>whelk-flesch-kincaid</artifactId>
            <version>0.1.6</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp -->
        <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>4.3.2</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/software.amazon.awssdk/comprehend -->
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>comprehend</artifactId>
            <version>2.17.110</version>
        </dependency>

I still get an RTE that model files are missing. Below is the stacktrace. Could you please help resolve this issue.

Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Error while loading a tagger model (probably missing model file)
	at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:798)
	at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:322)
	at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:275)
	at edu.stanford.nlp.pipeline.POSTaggerAnnotator.loadModel(POSTaggerAnnotator.java:85)
	at edu.stanford.nlp.pipeline.POSTaggerAnnotator.<init>(POSTaggerAnnotator.java:73)
	at edu.stanford.nlp.pipeline.AnnotatorImplementations.posTagger(AnnotatorImplementations.java:75)
	at edu.stanford.nlp.simple.Document$1.lambda$null$0(Document.java:96)
	at edu.stanford.nlp.util.Lazy$2.compute(Lazy.java:106)
	at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
	at edu.stanford.nlp.simple.Document$1.get(Document.java:96)
	at edu.stanford.nlp.simple.Document$1.get(Document.java:91)
	at edu.stanford.nlp.simple.Document.runPOS(Document.java:803)
	at edu.stanford.nlp.simple.Sentence.posTags(Sentence.java:484)
	at edu.stanford.nlp.simple.Sentence.posTags(Sentence.java:492)
	at edu.stanford.nlp.simple.Sentence.posTag(Sentence.java:497)
	at edu.stanford.nlp.simple.Token.lambda$tag$3(Token.java:129)
	at edu.stanford.nlp.simple.Token.pad(Token.java:70)
	at edu.stanford.nlp.simple.Token.tag(Token.java:129)
	at edu.stanford.nlp.simple.Token.posTag(Token.java:140)
	at io.whelk.flesch.kincaid.PennTreebankValidator.isWord(PennTreebankValidator.java:41)
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:176)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
	at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658)
	at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
	at io.whelk.flesch.kincaid.Tokenizer.tokenizeSentences(Tokenizer.java:57)
	at io.whelk.flesch.kincaid.ReadabilityCalculator.calculateGradeLevel(ReadabilityCalculator.java:123)
	at com.indeed.a11y.api.handlers.ReadabilityLambdaHandler.main(ReadabilityLambdaHandler.java:64)
Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/english-left3words-distsim.tagger" as class path, filename or URL
	at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:501)
	at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:795)
	... 33 more

Answered by zteater

Jan 16, 2022

@AdityaReddyY In retrospect, I don't this library should support AWS comprehend since it will add additional cost. We can use the CoreNLP library if we remove all of the bloat. For example, I removed unused dependencies and used the shade plugin to remove the unused packages like this:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">

	<modelVersion>4.0.0</modelVersion>

	<groupId>com.example</groupId>
	<artifactId>demo</artifactId>
	<version>0.0.1-SNAPSHOT</version>

	<properties>
		<project.bui…

View full answer

zteater · 2022-01-16T14:54:03Z

zteater
Jan 16, 2022
Maintainer

@AdityaReddyY In retrospect, I don't this library should support AWS comprehend since it will add additional cost. We can use the CoreNLP library if we remove all of the bloat. For example, I removed unused dependencies and used the shade plugin to remove the unused packages like this:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">

	<modelVersion>4.0.0</modelVersion>

	<groupId>com.example</groupId>
	<artifactId>demo</artifactId>
	<version>0.0.1-SNAPSHOT</version>

	<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
		<maven.compiler.source>11</maven.compiler.source>
		<maven.compiler.target>11</maven.compiler.target>
		<maven.compiler.release>11</maven.compiler.release>
	</properties>

	<dependencies>

		<dependency>
			<groupId>io.whelk.flesch.kincaid</groupId>
			<artifactId>whelk-flesch-kincaid</artifactId>
			<version>0.1.6</version>
		</dependency>

		<dependency>
			<groupId>edu.stanford.nlp</groupId>
			<artifactId>stanford-corenlp</artifactId>
			<version>4.3.2</version>
			<exclusions>
				<exclusion>
					<groupId>*</groupId>
					<artifactId>*</artifactId>
				</exclusion>
			</exclusions>
		</dependency>

		<dependency>
			<groupId>edu.stanford.nlp</groupId>
			<artifactId>stanford-corenlp</artifactId>
			<version>4.3.2</version>
			<classifier>models</classifier>
			<exclusions>
				<exclusion>
					<groupId>*</groupId>
					<artifactId>*</artifactId>
				</exclusion>
			</exclusions>
		</dependency>

		<dependency>
			<groupId>com.google.protobuf</groupId>
			<artifactId>protobuf-java</artifactId>
			<version>3.11.4</version>
		</dependency>

	</dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-shade-plugin</artifactId>
				<version>2.4.2</version>
				<executions>
					<execution>
						<phase>package</phase>
						<goals>
							<goal>shade</goal>
						</goals>
					</execution>
				</executions>
				<configuration>
					<transformers>
						<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
							<mainClass>com.example.demo.Example</mainClass>
						</transformer>
					</transformers>
					<filters>
						<filter>
							<artifact>*:*</artifact>
							<excludes>
								<exclude>edu/stanford/nlp/models/coref/*</exclude>
								<exclude>edu/stanford/nlp/models/coref/fastneural/*</exclude>
								<exclude>edu/stanford/nlp/models/coref/neural/*</exclude>
								<exclude>edu/stanford/nlp/models/coref/statistical/*</exclude>
								<exclude>edu/stanford/nlp/models/ner/*</exclude>
								<exclude>edu/stanford/nlp/models/dcoref/*</exclude>
								<exclude>edu/stanford/nlp/models/gender/*</exclude>
								<exclude>edu/stanford/nlp/models/kbp/english/*</exclude>
								<exclude>edu/stanford/nlp/models/kbp/english/gazetteers/*</exclude>
								<exclude>edu/stanford/nlp/models/kbp/english/semgrex/*</exclude>
								<exclude>edu/stanford/nlp/models/kbp/english/tokensregex/*</exclude>
								<exclude>edu/stanford/nlp/models/lexparser/*</exclude>
								<exclude>edu/stanford/nlp/models/naturalli/*</exclude>
								<exclude>edu/stanford/nlp/models/naturalli/affinities/*</exclude>
								<exclude>edu/stanford/nlp/models/parser/nndep/*</exclude>
								<exclude>edu/stanford/nlp/models/quoteattribution/*</exclude>
								<exclude>edu/stanford/nlp/models/sentiment/*</exclude>
								<exclude>edu/stanford/nlp/models/supervised_relation_extractor/*</exclude>
								<exclude>edu/stanford/nlp/models/sutime/*</exclude>
								<exclude>edu/stanford/nlp/models/truecase/*</exclude>
								<exclude>edu/stanford/nlp/models/ud/*</exclude>
								<exclude>edu/stanford/nlp/models/upos/*</exclude>
							</excludes>
						</filter>
					</filters>
				</configuration>
			</plugin>
		</plugins>
	</build>

</project>

This resulted in a jar ~16mb in size, well within the 250 upper limit of AWS lambdas. Hope this helps!

7 replies

zteater Jan 16, 2022
Maintainer

Can you post your full pom?

AdityaReddyY Jan 16, 2022
Author

My bad, this was a false alarm. Cleaning up my pom file which had other java libraries(had similar nlp dependencies) & rebuilding everything worked.

Once again, thanks for this library & the quick response

zteater Jan 16, 2022
Maintainer

Glad it is working AND being used, thanks @AdityaReddyY!

AdityaReddyY Jan 16, 2022
Author

Also, one more clarification. It works when I use Java 11 but when I use Java 8(which is the recommended version in my org), I am seeing the below bad class file error

/Users/***/.gradle/caches/modules-2/files-2.1/io.whelk.flesch.kincaid/whelk-flesch-kincaid/0.1.6/8f7d5854ff654c1930bbf3b74a9913346ae81d65/whelk-flesch-kincaid-0.1.6.jar(io/whelk/flesch/kincaid/ReadabilityCalculator.class)
    class file has wrong version 55.0, should be 52.0
    Please remove or make sure it appears in the correct subdirectory of the classpath.

Please let me know if you have any plans to add compatibility with older versions(at least 1.8)

zteater Jan 16, 2022
Maintainer

Sorry mate, but I only see this being supported on Java 11+ going forward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using AWS comprehend as source of models #233

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using AWS comprehend as source of models #233

AdityaReddyY Jan 14, 2022

Replies: 1 comment · 7 replies

zteater Jan 16, 2022 Maintainer

zteater Jan 16, 2022 Maintainer

AdityaReddyY Jan 16, 2022 Author

zteater Jan 16, 2022 Maintainer

AdityaReddyY Jan 16, 2022 Author

zteater Jan 16, 2022 Maintainer

AdityaReddyY
Jan 14, 2022

Replies: 1 comment 7 replies

zteater
Jan 16, 2022
Maintainer

zteater Jan 16, 2022
Maintainer

AdityaReddyY Jan 16, 2022
Author

zteater Jan 16, 2022
Maintainer

AdityaReddyY Jan 16, 2022
Author

zteater Jan 16, 2022
Maintainer