This repo provides a Dockerfile and Ansible provisioning script to build and run a Stanford CoreNLP server process with a single ZMQ broker font-end that proxies incoming requests to one or more back-end Scala workers. This setup is designed to parallelize and scale reasonably well, though at a certain point there's a question whether batch-processing large documents by transmitting them over the network makes much sense.
Running the server involves three steps:
First, clone the repo and build the docker container:
git clone https://github.com/twneale/corenlp-zmq/ cd corenlp-zmq docker build -t corenlp .
Next, install Supervisord if not already present on your system. On Debian/Ubuntu,
you can apt-get install supervisor
, and on RHEL/Centos you can yum install python-setuptools
, then easy_install supervisor
.
Next, start a supervisor process with the config file provided in the repo:
# First create a log directory mkdir log # To start a supervisor process in the foreground: supervisord -n -c supervisor/supervisor.conf # To start a supervisor daemon in the background: supervisord -c supervisor/supervisor.conf
That's it! You can now send JSON requests of the form shown below to port 5559 via on the host OS and recieve the CoreNLP output XML, or a Java traceback if an error occurs. Note that the Scala server's sbt build first has to boostrap itself and download several jar files, including the huge CoreNLP jar, so several minutes will pass before the server starts and can respond to requests. If you want to skip that process next time, you can run "docker ps" to get the container id, then "docker commit [id]" to save the container once the boostrapping in finished.
{annotators: "tokenize,ssplit,pos,lemma,parse", text: "I have a ham radio."}
The supervisor config file instructs supervisor to start two processes. The first is a Python request broker that listens on port 5559 and round-robin proxies incoming ZMQ requests to any connected worker processes.
docker run -i -t -p 5559:5559 --name broker corenlp /corenlp/virt/bin/python /corenlp/python/broker.py serve --frontend-port=5559 --backend-port=5560
The second starts one or more Scala worker processes, each of which loads the Core NLP java jar and registers itself with the Python request broker. On recieving a request, the Scala process builds an appropriate edu.stanford.nlp.pipeline.StanfordCoreNLP object (and caches it, because they're expensive) and runs the provided text through it, returning the response as a JSON object.
docker run -i -t --link broker:broker corenlp /bin/bash -c 'cd /corenlp/scala && sbt run'
You can also run these commands manually in the shell without to test things out with involving Supervisor.
To send some text through the server, you can run the example Python client script, provided you have ZeroMQ, the ZeroMQ dev headers, and pyzmq installed:
import sys import zmq def client(): context = zmq.Context() socket = context.socket(zmq.REQ) socket.connect("tcp://localhost:5559") def send(string): obj = dict(annotators='tokenize,ssplit,pos,lemma,parse', text=string) socket.send_json(obj) message = socket.recv_json() return message import pdb; pdb.set_trace() if __name__ == "__main__": client()
To increase the number of Scala worker processes, simply edit the "numprocs" setting in supervisor/conf.d/worker.conf, then restart the process with supervisor. This setup provides a bonafide parallelized CoreNLP processing tool, unlike other packages available, which may, for example, provide a networked interface to a single subprocess that communicates with CoreNLP via the shell. This package enables you to scale up the number of workers as needed, and could easily be upgraded to a cluster, by placing pointing Scala workers on different hosts to the same python frontent.