A lightweight, high-performance Java library to measure correctly the behavior of critical components in production.
Ultrabrew Metrics is a high-performance instrumentation library designed for use in large-scale JVM applications. It provides rich features such as metrics with dynamic dimensions (or tags), ability to manage multiple reporters and encourages accuracy over large deployments.
Existing metrics libraries such as Dropwizard Metrics previously served us well. Unfortunately, those libraries are starting to show their age. As a result, we saw the need to write a new library designed primarily for scale and to support essential features such as dynamic dimensions.
To better understand the concepts and terms used by this library, please see CONCEPTS.md
- Support dynamic dimensions (tag keys and values) at the time of measurement.
- Reduce GC pressure by minimizing the number of objects created by the library.
- Support accurate aggregation in reporters via monoids
- Minimize number of dependencies.
- Decouple instrumentation from reporting in the following ways:
- adding a new reporter or modifying an existing reporter does not require changing instrumentation code;
- each reporter aggregates measurements independently; and
- multiple reporters may report at different intervals.
- (TODO) Support raw event emission for external service consumption.
- E.g., sending UDP packets to external service similar to statsd, which could do aggregation before sending to an actual time series storage, or sending raw events directly to an alerting service or to a time-series database.
- (TODO) Support better cumulative or global-percentile approximation across multiple servers or deployments by using structures such as Data Sketches and T-Digests.
- The metrics library must allow millions of transactions per second in a single JVM process with very little overhead. The largest known application currently handles 4M+ RPS with 40+ threads writing to the metrics library in a single JVM.
- Each service transaction may cause dozens (10+) of metric measurements.
- Each metric may have dozens (10+) of tag dimensions, each with hundreds (100+) of tag values and a few (5+) fields. The combined time-series cardinality in a JVM can be more than 1,000,000.
As mentioned above, we aspire to improve accuracy of measurements at large scale. In the past, we have used libraries that support Average as an aggregation function (or field) to be emitted from each server. When looking at these metrics across a large deployment, we tend to further aggregate the metrics leading to incorrect results (sum of averages, average of averages, etc). Most people do this without realizing the mistake, which is very easy to make.
In order to avoid this problem, we have taken a stance to NOT track averages and instead focus on fields that can be further aggregated like Sum, Count, Min, Max, etc. Those who wish to obtain average values can implement weighted-average functions at the reporting layer based on Sum and Count fields.
For example, when tracking a latency, the library would emit:
api.request.latency.sum
api.request.latency.count
When querying the data for multiple hosts, sum all of api.request.latency.sum
and sum all of
api.request.latency.count
, then compute sum (api.request.latency.sum) / sum (api.request.latency.count)
.
We have heavily borrowed from practices commonly employed when building latency critical applications including techniques often seen in HFT Libraries. Here are some of the ways in which we are able to squeeze out the most performance from JVM -
- Avoid Synchronization by using Java Atomic classes and low-level operations from Java's Unsafe API. Additionally, the data fields (arrays) are 64-byte aligned to match L1 and L2-cache line size to avoid the use of locks explicitly.
- Use primitives whenever possible to avoid high object creation and GC concerns. While this may seem obvious we find engineers using objects excessively when primitives would suffice.
- We have replaced Java's
HashMaps
, which tend to be object-based, with Linear Probing Tables using primitive (long
) arrays. - The core library does not create threads. Instead writes are done using the caller's thread. Reporters manage their own threads for reading and publishing. This eliminates the need for a queue between caller and core library.
In order to use the Ultrabrew Metrics library, you need to add a dependency to your Java project to
the reporters you want to use in your project. All reporters included in this repository are found
in the bintray.com
maven repository, where the core
project libraries are found as well.
repositories {
mavenCentral()
}
dependencies {
compile group: 'io.ultrabrew.metrics', name: 'metrics-{your reporter}', version: '0.9.0'
}
<dependencies>
<dependency>
<groupId>io.ultrabrew.metrics</groupId>
<artifactId>metrics-{your reporter}</artifactId>
<version>0.9.0</version>
</dependency>
</dependencies>
There are two distinct and independent phases on using the library: Instrumentation and Reporting. The goal is to be able to instrument the code once and only modify reporting code with no or very minimal changes to the instrumentation.
Metric Registry is a collection of metrics, which may be subscribed by a reporter. Each metric is always associated only with a single metric registry, but reporters may subscribe to multiple metric registries. Generally you only need one metric registry, although you may choose to use more if you need to organize your metrics in particular reporting groups or subscribe with different reporters.
Note: All metrics have a unique identifier. You are not allowed to have multiple different types of metrics for the same identifier. Furthermore, if you attach a reporter to multiple metric registries, the reporter will aggregate all metrics with the same identifier. In general, it is best to ensure that identifiers you use for metrics are globally unique.
The currently supported metric types are as follows:
Counter
increment or decrement a 64-bit integer valueGauge
measures a 64-bit integer value at given timeGaugeDouble
measure a double precision floating point value at a given timeTimer
measure elapsed time between two events and act as counter for these events
Reporters are responsible for using the best aggregation mechanism, and proper monoid data fields, based on the metric type and the monitoring or alerting system it is reporting to. This includes possible mean, local minimum and maximum values, standard deviations, quantiles and others.
An example how to create a metric registry.
MetricRegistry metricRegistry = new MetricRegistry();
An example how to use a Counter to measure a simple count with dynamic dimensions.
public class TestResource {
private static final String TAG_HOST = "host";
private static final String TAG_CLIENT = "client";
private final Counter errorCounter;
private final String hostName;
public TestResource(final MetricRegistry metricRegistry,
final String hostName) {
errorCounter = metricRegistry.counter("errors");
this.hostName = hostName;
}
public void handleError(final String clientId) {
errorCounter.inc(TAG_CLIENT, clientId, TAG_HOST, hostName);
// .. do something ..
}
}
An example how to use a Gauge to measure a long value at a given time. GaugeDouble works similarly, but for double precision floating point values.
public class TestResource {
private final Gauge cacheSizeGauge;
private final String[] tagList;
private final Map<String,String> cache;
public TestResource(final MetricRegistry metricRegistry, final String hostName) {
cacheSizeGauge = metricRegistry.gauge("cacheSize");
cache = new java.util.Map<>();
tagList = new String[] { "host", hostName };
}
public void doSomething() {
cacheSizeGauge.set(cache.size(), tagList); // this example uses only static tags
}
}
An example how to use a Timer to measure execution time and request count with dynamic and static dimensions.
public class TestResource {
private static final String TAG_HOST = "host";
private static final String TAG_CLIENT = "client";
private static final String TAG_STATUS = "status";
private final Timer requestTimer;
private final String hostName;
public TestResource(final MetricRegistry metricRegistry,
final String hostName) {
requestTimer = metricRegistry.timer("requests");
this.hostName = hostName;
}
public void handleRequest(final String clientId) {
final long startTime = requestTimer.start();
int statusCode;
// .. handle request ..
// Note: no need for separate counter for requests per sec, as count is already included
requestTimer.stop(startTime, TAG_CLIENT, clientId, TAG_HOST, hostName, TAG_STATUS,
String.valueOf(statusCode));
}
}
A reporter subscribes to a single or multiple metric registries and consumes the measurement events. It may forward the events to an external aggregator and/or send raw events to an alerting service or a time series database. The metrics library currently comes with the following reporters:
InfluxDBReporter
reports to InfluxDB time series database. More information here.OpenTSDBReporter
reports to OpenTSDB time series database. More information here.SLF4JReporter
reports to SLF4J Logger with given name to log the aggregated values of the metrics.NOTE: This reporter IS NOT intended to be used in production environments, and is only provided for debugging purposes.
An example how to attach a SLF4JReporter to the metric registry, and configure it to use metrics
SLF4J Logger.
SLF4JReporter reporter = SLF4JReporter.builder().withName("metrics").build();
metricRegistry.addReporter(reporter);
In the current implementation, clients must define the distribution buckets and associate them in the reporter with the name of the metric to be histogrammed.
There two types of distribution buckets available:
DistributionBucket
represented by a primitivelong
array.DoubleValuedDistributionBucket
represented by a primitivedouble
array
Used to represent the distribution of an integer value. For example time spent in nanoseconds or size of a messaging queue.
For a given array of latency distribution in nanoseconds [0, 10_000_000, 100_000_000, 500_000_000, 1000_000_000], the buckets would be like:
- [0, 10_000_000) for 0 <= value < 9_999_999
- [10_000_000, 100_000_000) for 10_000_000 <= value < 99_999_999
- [100_000_000, 500_000_000) for 100_000_000 <= value < 499_999_999
- [500_000_000, 1000_000_000) for 500_000_000 <= value < 999_999_999
- overflow for values >= 1000_000_000
- underflow for values < 0
String metricId = "latency";
DistributionBucket distributionBucket = new DistributionBucket(new long[]{0, 10_000_000, 100_000_000, 500_000_000, 1000_000_000});
SLF4JReporter reporter =
SLF4JReporter.builder().withName("metrics")
.addHistogram(metricId, distributionBucket) // add histogram for metric with id "latency"
.build();
String[] tagset = new String[] {"method", "GET", "resource", "metrics", "status", "200"};
Timer timer = metricRegistry.timer(metricId); // creates a timer metric with id "latency"
long start = Timer.start();
// doSomething();
timer.stop(start, tagset); // records the latency and the distribution in nanoseconds.
Used to represent the distribution of a double-precision floating point value. For example ads auction price.
For a given distribution array: [0.0, 0.25, 0.5, 1.0, 5.0, 10.0], the buckets would be like:
- [0.0, 0.25) for 0.0 <= value < 0.25
- [0.25, 0.5) for 0.25 <= value < 0.5
- [0.5, 1.0) for 0.5 <= value < 1.0
- [1.0, 5.0) for 1.0 <= value < 5.0
- [5.0, 10.0) for 5.0 <= value < 10.0
- overflow for values >= 10.0
- underflow for values < 0.0
String metricId = "auction_price";
DoubleValuedDistributionBucket distributionBucket = new DoubleValuedDistributionBucket(new double[]{0.0, 0.25, 0.5, 1.0, 5.0, 10.0});
SLF4JReporter reporter =
SLF4JReporter.builder().withName("metrics")
.addHistogram(metricId, distributionBucket) // add histogram for metric with id "auction_price"
.build();
String[] tagset = new String[] {"experiment", "exp1"};
GaugeDouble auctionPrice = metricRegistry.gaugeDouble(metricId); // creates a gauge double metric with id "auction_price"
auctionPrice.set(getAuctionPrice(), tagset); // records the auction_price and the distribution.
Please refer to the Contributing.md file for information about how to get involved. We welcome issues, questions, and pull requests. Pull Requests are welcome.
- Mika Mannermaa @mmannerm
- Smruti Ranjan Sahoo @smrutilal2
- Ilpo Ruotsalainen @lonemeow
- Chris Larsen @manolama
- Arun Gupta @arungupta
This project is licensed under the terms of the Apache 2.0 open source license. Please refer to LICENSE for the full terms.