These Core Libraries are intended for rapidly building Smart Cache implementations in Java. It provides a lightweight functional API for defining data processing pipelines as well as more admin/operations oriented APIs for invoking and running Smart Cache pipelines.
This landing page provides an overview of the Core Concepts and Components within these libraries and provides links to more detailed documentation on each area. If you'd like to understand more about the Design of these libraries that is detailed in a separate document.
These libraries are built around a number of core concepts which are used together to build up data processing pipelines. We provide brief overviews of these concepts here with links to more detailed documentation, where you will also find documentation for the concrete implementations of these and example usage.
An EventSource
provides access to a source of Event
and is strongly typed. So for example an EventSource<String, Graph>
provides a source of events where each event has a String
key and an RDF Graph
value.
You can check whether events are available, poll for the next available event, and close the event source when finished with it.
See the Event Source documentation for more details.
Extending on the concept of event sourcing above, ComponentEventSource
allows any application component to be a source
of ComponentEvent
and is strongly typed. The first application of this is for components to emit metric events. These
are events with one or more associated metrics emanating from their processing activities.
So for example, a search API may implement ComponentEventSource<SearchEvent>
, signifying it produces events related to
search. Such events might contain information about the term searched, the identity of the person or principle performing
the search and the time taken for ElasticSearch to carry out the search (i.e. a DurationMetric
).
Refer to Observability for further information.
A Sink
is a functional interface to which items can be sent for processing. This is intended for defining simple
processing pipelines that implement data processing or transformation.
By simple pipelines we mean those that represent linear, non-branching sequences of processing steps. See the Related Work section of the Design overview for discussion of alternative processing frameworks.
Like an Event Source this is a strongly typed interface, so you might have a Sink<Graph>
for a sink that takes in RDF
Graphs.
See the Sink
documentation for more details.
An Entity
is a view over data that is entity-centric. An individual event from an EventSource
may batch up a bunch
of data about several entities, e.g. people, places, things, whereas a data processing pipeline may only be concerned
with a subset of those entities. The Entity
class provides a minimal representation of an entity, it has a mandatory
URI that identifies the entity, then may have a set of namespace prefixes and/or some EntityData
.
A Projector
takes in an input and produces zero or more outputs to a provided Sink
.
Again this is strongly typed, so you might have a Projector<Graph, Entity>
that projects from RDF Graphs to Entities.
A ProjectorDriver
automates the connection of the various core concepts into a runnable
application. It takes in an EventSource
, a Projector
and a Sink
and automates the polling of events from the event
source and passing those through the projector and onto the Sink
. It also takes various additional parameters to
control various aspects of this behaviour, see the Projection documentation for more details.
There are also some other components included in this repository, while these are more focused to specific classes of applications they are still sufficiently general to be included here.
The observability-core
module provides some helper utilities around integrating Open Telemetry based metrics into
applications built with these libraries. Various classes throughout the other libraries use the Open Telemetry API to
declare and track various metrics. Please see the Observability documentation for details of
how to make these metrics actually accessible, and to add additional metrics to your applications.
The live-reporter
module provides the LiveReporter
that enables Java based applications to report their status
heartbeats to Telicent Live, our platforms monitoring dashboard. Please see the Live Reporter
documentation for more details on this module.
The configurator
module provides the Configurator
API that is a lightweight abstraction for obtaining application
configuration. In particular this makes it easy to inject configuration in multiple ways which is especially useful for
unit and integration tests.
Please see Configurator documentation for more details.
The CLI provides an API for creating CLI entry points to data processing pipelines. It defines a common
SmartCacheCommand
from which CLI commands can be derived and various abstract implementations of this class that
provide prefabricated entry points for common pipelines.
Please see CLI documentation for more details.
The JAX-RS Base Server provides a basic starting point for new JAX-RS based server applications including lots of common machinery around error handling, authentication and configuration.
Please see JAX-RS Base Server documentation for more details.
Using the various concepts provided by these libraries we can build new data processing pipelines relatively easily since many of the components we need are reusable and configurable. For example, you might conceive of a pipeline like the following:
The above example is implemented in the ElasticSearch Smart Cache repository where everything except the final JSON to Elastic piece is provided from this repository.
Note that the actual pipeline as implemented has additional sinks in it versus the above rough design sketch. We use a Throughput Sink for logging throughput metrics, a Filter Sink for filtering out irrelevant entities, and a Duplicate Suppression sink to avoid repeatedly indexing identical entity representations.
The entire pipeline ends up being:
- Event Source
- Entity Centric Projector
- Throughput Metric Reporting
- Filter Irrelevant Entities
- Convert Entities into JSON Document Structure
- Duplicate Document Suppression
- Bulk Indexing to ElasticSearch
All of which is automated via a ProjectorDriver
.
Every piece of the pipeline uses functionality, or interfaces, from these Core Libraries to build the overall pipeline. You can find more detailed documentation on this pipeline in that repository.