Examples showing how to use the Readr REST API to connect with Readr in the cloud. This makes it possible to push large document collections into Readr cloud, which makes it easy to explore the collections and create annotations as well as extraction patterns. Annotations and patterns can later be downloaded using the REST API.
readr-connect has been tested on MacOS X with Scala 2.10.4 and sbt 0.13.
To run the examples, you must first set the user, password, and ns fields in conf/application.conf. The ns field (namespace) is automatically generated when you create an account on readr.com.
Start by creating a project
sbt "runMain example.CreateProject1"
source
and adding a document
sbt "runMain example.AddDocument1"
source
These two examples actually do slightly more than that: the first creates a project with a set of default annotation layers, and the second precomputes these annotations when storing the document. The annotations include tokenization, lemmatization, and more. For more fine grained control over annotations see source and source.
Next, we will create a semantic frame with an extraction pattern
sbt "runMain example.CreateFrameWithPattern"
source
and fetch the matches for our extraction pattern
sbt "runMain example.FetchPatternMatches"
source
At this point, you can also validate a few of the generated matches using the web interface, or create additional annotations. These can then be retrieved using
sbt "runMain example.FetchPatternAnnotations1"
source
The previous example writes these annotations to the screen, but of course we can also store these in a file (source), and later push them back into the cloud (source). While these examples handle the case for one given frame, we can also fetch and write back all frames, patterns, and annotations, at once, as shown in examples source and source.
For large corpora, we recommend doing all preprocessing locally (or on another cluster) and then push the results to Readr cloud. For processing we use Apache Spark. You must have Apache spark installed in a directory if you would like to process and push new datasets to readr. Fetch spark at https://spark.apache.org/downloads.html
. We have tested our system on Spark 1.0.2.
Start by converting your documents into the readr format.
sbt "runMain example.large.CreateSource"
source
Then, run your processing on Apache Spark. The readr-spark project contains more information on how this is done.
Finally, you can push the results to Readr cloud.
sbt "runMain example.large.CreateDB"
source
Spark uses Kryo for efficient serialization and deserialization of objects. We can also fetch and write back frames in Kryo. This makes it easy to generate a large number of frames, for example based on a resource.
sbt "runMain example.large.FetchFrames"
source
sbt "runMain example.large.PutFrames"
source