readr-connect

Examples showing how to use the Readr REST API to connect with Readr in the cloud. This makes it possible to push large document collections into Readr cloud, which makes it easy to explore the collections and create annotations as well as extraction patterns. Annotations and patterns can later be downloaded using the REST API.

readr-connect has been tested on MacOS X with Scala 2.10.4 and sbt 0.13.

To run the examples, you must first set the user, password, and ns fields in conf/application.conf. The ns field (namespace) is automatically generated when you create an account on readr.com.

1. Basic API examples

Start by creating a project

sbt "runMain example.CreateProject1" source

and adding a document

sbt "runMain example.AddDocument1" source

These two examples actually do slightly more than that: the first creates a project with a set of default annotation layers, and the second precomputes these annotations when storing the document. The annotations include tokenization, lemmatization, and more. For more fine grained control over annotations see source and source.

Next, we will create a semantic frame with an extraction pattern

sbt "runMain example.CreateFrameWithPattern" source

and fetch the matches for our extraction pattern

sbt "runMain example.FetchPatternMatches" source

At this point, you can also validate a few of the generated matches using the web interface, or create additional annotations. These can then be retrieved using

sbt "runMain example.FetchPatternAnnotations1" source

The previous example writes these annotations to the screen, but of course we can also store these in a file (source), and later push them back into the cloud (source). While these examples handle the case for one given frame, we can also fetch and write back all frames, patterns, and annotations, at once, as shown in examples source and source.

2. Working with large corpora

For large corpora, we recommend doing all preprocessing locally (or on another cluster) and then push the results to Readr cloud. For processing we use Apache Spark. You must have Apache spark installed in a directory if you would like to process and push new datasets to readr. Fetch spark at https://spark.apache.org/downloads.html. We have tested our system on Spark 1.0.2.

Start by converting your documents into the readr format.

sbt "runMain example.large.CreateSource" source

Then, run your processing on Apache Spark. The readr-spark project contains more information on how this is done.

Finally, you can push the results to Readr cloud.

sbt "runMain example.large.CreateDB" source

Spark uses Kryo for efficient serialization and deserialization of objects. We can also fetch and write back frames in Kryo. This makes it easy to generate a large number of frames, for example based on a resource.

sbt "runMain example.large.FetchFrames" source

sbt "runMain example.large.PutFrames" source

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
conf		conf
src/main/scala/example		src/main/scala/example
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
shippable.yml		shippable.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

readr-connect

1. Basic API examples

2. Working with large corpora

About

Releases

Packages

Languages

readr-code/readr-connect

Folders and files

Latest commit

History

Repository files navigation

readr-connect

1. Basic API examples

2. Working with large corpora

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages