Metaslurper is a command-line tool that harvests digital object properties and metadata from one or more source services, normalizes it, and uploads it to a sink service. It supports:
- Efficient streaming of large numbers of entities from any number of source services
- Throttling
- Multi-threaded harvesting
- Incremental harvesting
- Harvest count limits
Support for new source and sink services is straightforward to implement.
Metaslurper generally passes along whatever key-value entity metadata the source services make available to the sink service without modifying it. The sink service decides what to do with these disparate elements: which ones to keep, how to map them, etc. This enables the harvester to be written in a generalized way and run with little configuration (other than needing to know the URLs and authentication info for the various endpoints) and in conjunction with pretty much any metadata mapping process.
command-line invocation
(for development & demonstration)
|
V
------------------ --------------- ----------------
| | | | invocation | |
| | queries | | <------------- | |
| | <------------ | | | |
| source service | | metaslurper | content | sink service |
| | | | -------------> | |
| | content | | | |
| | ------------> | | status updates | |
| | | | -------------> | |
------------------ --------------- ----------------
Sink services are modular, too. Currently, two are available:
The only requirements are a JDK (see docker/Dockerfile
for required version)
and Maven. CPU and memory requirements are minimal.
Docker is required for deployment to AWS ECR. See "AWS ECS Notes" below.
$ mvn clean package -DskipTests
Service configuration is sourced from the environment. The following variables are available:
- Source services
- Illinois Data Bank
SERVICE_SOURCE_IDB_KEY
SERVICE_SOURCE_IDB_ENDPOINT
- Illinois Digital Library
SERVICE_SOURCE_DLS_KEY
SERVICE_SOURCE_DLS_ENDPOINT
SERVICE_SOURCE_DLS_USERNAME
SERVICE_SOURCE_DLS_SECRET
- Illinois Digital Newspaper Collections
SERVICE_SOURCE_IDNC_KEY
SERVICE_SOURCE_IDNC_ENDPOINT
SERVICE_SOURCE_IDNC_HARVEST_SCRIPT_URI
- IDEALS
SERVICE_SOURCE_IDEALS_KEY
SERVICE_SOURCE_IDEALS_ENDPOINT
- Medusa Book Tracker
SERVICE_SOURCE_BOOK_TRACKER_KEY
SERVICE_SOURCE_BOOK_TRACKER_ENDPOINT
- Illinois Data Bank
- Sink services
- Metaslurp
SERVICE_SINK_METASLURP_KEY
SERVICE_SINK_METASLURP_ENDPOINT
SERVICE_SINK_METASLURP_USERNAME
SERVICE_SINK_METASLURP_SECRET
SERVICE_SINK_METASLURP_HARVEST_KEY
(if not set, a new harvest will be initiated)SERVICE_SINK_METASLURP_INDEX
(if not set, the default index is used)
- Metaslurp
Invoke with no arguments to print a list of available arguments:
java -jar target/metaslurper-VERSION.jar
Example kitchen-sink invocation:
java -jar target/metaslurper-VERSION.jar \
-source test_source \
-sink test_sink \
-log_level info \
-max_entities 50 \
-threads 2 \
-throttle 100 \
-incremental 1535380169
Change test_source
to a random string to print a list of available service
keys.
docker-run.sh <environment> <source service key> <sink service key>
mvn test
runs the tests, but you will need to set all of the environment
variables listed above first. You could create a test.sh
that does that, or
you could put the variables in a test.env
file and run the tests using
docker compose up --build
.
- Add a class that implements
e.i.l.m.service.SourceService
- Add it to the return value of
e.i.l.m.service.ServiceFactory.allSourceServices()
The service will probably require a couple of new configuration keys (a.k.a. environment variables). In AWS, there are two ways to make these available:
- Add them to the ECS task definition. If using Metaslurp as a sink, the value
of its
METASLURPER_ECS_TASK_DEFINITION
environment variable must then be changed to this new version, if it is not already usinglatest
. - Pass them into the task invocation via the ECS API.
- Add a class that implements
e.i.l.m.service.SinkService
- Add it to the return value of
e.i.l.m.service.ServiceFactory.allSinkServices()
- A logger is available via
LoggerFactory.getLogger(Class)
. - Configuration should be obtained from
e.i.l.m.config.Configuration.getInstance()
rather thanSystem.getenv()
. - Services are free to use any HTTP client. Most services use OkHttp, which is bundled in.
The general procedure for deploying to ECS is:
- Install Docker
- Create an ECR repository, an ECS Fargate cluster, and an ECS task definition
- The task definition must define all of the environment variables in the "Configuration" section (above)
- The task definition also must specify a read-write filesystem (for
/tmp
usage) - (At UIUC, all of this is terraformed in demo and production.)
- Install the
aws
command-line tool cp ecr-push.sh.sample ecr-push.sh
and edit as necessaryecr-push.sh
At this point the container is available and tasks are ready to run. One way to
run them is with the aws
command-line tool, for which a convenient wrapper
script has been written:
ecs-run-task.sh <environment> <source service key> <sink service key>
But they can also be invoked via the ECS API or web UI.