Skip to content

Data Model

Brian L. Troutwine edited this page Oct 3, 2017 · 3 revisions

There are two stories to cernan's data model, one to do with durability of data and the other with aggregation.

Durability

Cernan works very hard to store and process every piece of information you send it and, in doing so, to never overwhelm your system. This is born of our frustation with other telemetry systems which fail during crisis periods on account of high telemetry load. That is, should your application begin to frantically emit telemetry about its failing state cernan must be able to ingest and ship this outward.

Cernan's main line of effort in this regard is a disk based queueing system that allow individual source and sinks to communicate with one another. Each telemetry point that comes into the system is parsed and serialized to disk. These serialized points are only read from disk when a sink is capable of processing it. This limits cernan's eposure to restart related data-loss and puts a hard cap on cernan's online allocations.

Aggregation

Cernan is timestamped accurate to the second. Every point of telemetry that is ingested by cernan is timestamped on receipt, in the case of log lines and statsd, or by parsing the payload, as in the case of graphite. Cernan sinks which opt into the use of the buckets structure aggregate points according to their "AggregationMethod". Each telemetry is binned by the second. Bin widths are configurable per sink. By default bin widths are one second. The AggregationMethods are:

  • SUM :: A sum of samples in a time bin. This can be interpreted as a per-bin counter.
  • SET :: Preserves the last sample set into the Telemetry stream per time window.
  • SUMMARIZE :: Produces a quantile summary of the input samples per time window.
  • HISTOGRAM :: Produces a binned histogram summary of the input samples per time window.

The SUM and SET aggregations are defined in cernan itself. The SUMMARIZE aggregation is backed by quantile's CKMS and HISTOGRAM by Histogram. Sinks are free to choose how to interpret aggregations. Please see sink documentation for full details.

Telemetry may also be considered 'persisted' or 'ephemeral'. A telemetry stream that is persisted will roll over as the bin window moves forward in time even if no additional points are added to the stream. A telemetry stream that is empherial will only exist so long as more points are ingested by cernan. Sinks are free to interpret persistence as they wish. Please see sink documentation for full details.

Cernan's internal aggregation model mirrors that laid out in cernan's native protocol, defined here. Please see the Sources documentation to learn how cernan's source protocols map onto this model.