Description
We are interested in a mechanism to control unbounded growth of metrics. While we generally follow best practices around limiting cardinality, for extreme long lived processes this is still problematic. For instance, its common to record the binary version of something in a metric, but with 100s of rollouts over days or months, these can explode in time series if the metrics collection is never restarted.
We would like some way to control this in our application.
Currently, there is a a .clear()
and .remove()
. These are good building blocks, but I am not sure they are sufficient on their own.
remove()
is challenging on its own because we don't have any way to understand the entire set of labels stored in the metric at any point. In theory you could use EncodeMetric::encode
and parse the results, but that is quite hacky.
clear()
is also challenging, because it is all or nothing.
Ideally, I think we would have some interface like:
family.retain_if(|(labelset, metric)| {
Instant::now().duration_since(metric.last_write()) < Duration::from_secs(3600)
})
(remove any metrics not modified for an hour)
This would require a method on the family, but also maybe some changes on the metric type as well to make this easier to encode.
In #196 I have put up a small draft of what this could look like, but very open to alternatives