This documents the metrics and tags emitted by gostatsd, their type, tags, and interpretation. All internal metrics are snapshot after a flush, then queued internally for sending in the next flush. Specifically this means that internal metrics will lag regular metrics by one flush interval. See below for notes on how channels are monitored.
Metric types:
type | description |
---|---|
gauge (flush) | A value sent as a gauge with the value reset / calculated / sampled every flush interval |
gauge (time) | A single duration measured in milliseconds and sent as a gauge |
gauge (cumulative) | An internal counter sent as a gauge with the value never resetting |
counter | An internal counter, reset on flush |
Metrics:
Name | type | tags | description |
---|---|---|---|
aggregator.metrics_received | gauge (flush) | aggregator_id | The number of datapoints received during the flush interval |
aggregator.aggregation_time | gauge (time) | aggregator_id | The time taken (in ms) to aggregate all counter and timer |
datapoints in this flush interval | |||
aggregator.process_time | gauge (time) | aggregator_id | The time taken to process all synchronous flush actions |
aggregator.reset_time | gauge (time) | aggregator_id | The time taken to reset the aggregator after flush |
parser.bad_lines_seen | gauge (cumulative) | The number of unparseable lines | |
parser.events_received | gauge (cumulative) | The number of events parsed | |
parser.metrics_received | gauge (cumulative) | The number of metrics parsed | |
receiver.datagrams_received | gauge (cumulative) | The number of datagrams received | |
receiver.avg_datagrams_in_batch | gauge (flush) | The average number of datagrams per batch (up to receive-batch-size). This | |
can be used to tweak receive-batch-size if necessary to reduce memory usage. | |||
channel.avg | gauge (flush) | channel | The average of all samples in the flush interval |
channel.min | gauge (flush) | channel | The minimum sample seen |
channel.max | gauge (flush) | channel | The maximum sample seen |
channel.last | gauge (flush) | channel | The last sample seen |
channel.capacity | gauge (flush) | channel | The capacity of the channel |
channel.samples | gauge (flush) | channel | The number of samples seen (guaranteed to be at least 1) |
internal_dropped | gauge (cumulative) | The number of internal metrics which have been dropped | |
heartbeat | gauge (flush) | version, commit | The value 1, tagged by the version (git tag) and short commit hash |
flusher.total_time | gauge (time) | Time taken to flush all metrics to all backends for the flush interval | |
backend.created | gauge (cumulative) | backend | Lifetime number of metric batches generated by the backend |
backend.retried | gauge (cumulative) | backend | Lifetime number of metric batches retried by the backend |
backend.dropped | gauge (cumulative) | backend | Lifetime number of metric batches dropped by the backend (DATALOSS!) |
backend.sent | gauge (cumulative) | backend | Lifetime number of metric batches successfully transmitted |
cloudprovider.aws.describeinstancecount | gauge (cumulative) | The cumulative number of times DescribeInstancesPages has been called | |
cloudprovider.aws.describeinstanceinstances | gauge (cumulative) | The cumulative number of instances which have been fed in to DescribeInstancesPages | |
cloudprovider.aws.describeinstancepages | gauge (cumulative) | The cumulative number of pages from DescribeInstancesPages | |
cloudprovider.aws.describeinstanceerrors | gauge (cumulative) | The cumulative number of errors seen from DescribeInstancesPages | |
cloudprovider.aws.describeinstancefound | gauge (cumulative) | The cumulative number of instances successfully found via DescribeInstances | |
cloudprovider.cache_positive | gauge (flush) | The absolute number of positive entries in the cache | |
cloudprovider.cache_negative | gauge (flush) | The absolute number of negative entries in the cache | |
cloudprovider.cache_refresh_positive | gauge (cumulative) | The cumulative number of positive refreshes | |
cloudprovider.cache_refresh_negative | gauge (cumulative) | The cumulative number of refreshes which had an error refreshing and used old data | |
cloudprovider.cache_hit | gauge (cumulative) | The cumulative number of cache hits (host was in the cache) | |
cloudprovider.cache_late_hit | gauge (cumulative) | The cumulative number of late cache hits (host was not in the cache, but had a lookup | |
in progress which completed) | |||
cloudprovider.cache_miss | gauge (cumulative) | The cumulative number of cache misses | |
cloudprovider.hosts_queued | gauge (flush) | type | The absolute number of hosts waiting to be looked up |
cloudprovider.items_queued | gauge (flush) | type | The absolute number of metrics or events waiting for a host lookup to complete |
http.forwarder.invalid | counter | The number of failures to prepare a batch of metrics to forward | |
http.forwarder.created | counter | The number of batches prepared for forwarding | |
http.forwarder.sent | counter | The number of batches successfully forwarded | |
http.forwarder.retried | counter | The number of retries sending a batch | |
http.forwarder.dropped | counter | The number of batches dropped due to inability to forward upstream | |
http.incoming | counter | server-name, result, failure | The number of batches forwarded to the server, and the results of processing them |
http.incoming.metrics | counter | server-name | The number of metrics received over http |
Tag | Description |
---|---|
aggregator_id | The index of an aggregator, the amount corresponds to the --max-workers flag |
channel | The name of an internal channel |
version | The git tag of the build |
commit | The short git commit of the build |
backend | The backend sending a particular metric |
type | Either metric or event |
result | Success to indicate a batch of metrics was successfully processed, failure to indicate a batch of metrics was not processed, with additional failure tag for why) |
failure | The reason a batch of metrics was not processed |
server-name | The name of an http-server as specified in the config file |
A number of channels are tracked internally, they emit metrics under the channel.* space. They will all have a channel tag, and may have additional tags specified below. Channels are sampled at a regular interval. After a flush, basic stats are sent about the data sampled (internal metrics lag regular metrics by a flush interval) and the samples are reset.
Channel name | Additional tags | Description |
---|---|---|
dispatch_aggregator | aggregator_id | Channel to dispatch metrics to a specific aggregator. |
backend_events_sem | Semaphore limiting the number of events in flight at once. Corresponds to | |
the --max-concurrent-events flag. |
- If both --internal-namespace and --namespace are specified, and metrics are dispatched internally, the resulting metric will be namespace.internal_namespace.metric.