-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics SDK: define standard circuit breakers for uncollected, infinite cardinality recordings #1891
Comments
I'm trying to address these topics in #1966, leave it as guidance/suggestion rather than spec requirements. |
An initial implementation of cardinality defenses was recently implemented in java. Im summary, each view has a max number of metric streams. Currently, this max is hardcoded to be 2000, which was chosen rather arbitrarily. If exceeded, we log a warning and drop subsequent recordings. At the same time we aggressively forget metric streams - any metric stream that doesn't have recordings during a collection is forgotten. This applies to cumulative / delta and sync / async cases. |
@jack-berg -- @MrAlias asked for guidance on this topic in open-telemetry/opentelemetry-go#3006 (comment) I am aware of incompleteness in our specification around the appearance or disappearance of metrics from one collection to the next, depending on async/sync and depending on temporality. I came up with my own answers to this set of questions for my (i.e., Lightstep's) experimental SDK, and one of the deciding factors came from trying to pin down the behavior of synchronous gauge instruments -- while prototyping for #2318. At the highest level, my approach to these questions is "when cumulative temporality, do what a Prometheus user expects" and "when delta temporality, do what a Statsd user expects". To be clear, I'm interested in specifying how an SDK should handle the gradual accumulation of memory due to cardinality across multiple collection cycles. I am less interested in specifying how the SDK should handle sudden accumulation of memory in a single collection cycle (in this situation, I will advocate the use of sampling, but it's a separate topic). By the above paragraph's logic, the solution for high-cardinality in a delta-temporality exporter is to shorten the collection interval because the SDK is allowed to forget all memory following every collection. The problem referred to in open-telemetry/opentelemetry-go#3006 is for cumulative exporters, and I suggest we solicit input from the Prometheus team (cc @RichiH @gouthamve). The primary "tool" we have to address this cardinality problem is the use of
The third question is the most difficult. When a process starts from scratch, it is understandable that we use the process start time as the start time when a metric first appears. But -- what does that mean? If a process starts, one hour passes, and then a Counter is incremented for the first time, at that moment we can say that the series has a lifetime rate of 1/hour. One hour later, we decide to drop that series. After another hour, the series has another count. Its lifetime rate is 2 events per 3 hours. Note that if our goal is to eliminate memory, we can't exactly "keep" the timestamp when an individual series was forgotten. By this logic, we can't distinguish series that have been reset from series that have never reported. We can no longer use the process start time for "new" series, because we don't know what is new and what was previously used. When reporting a new-or-previously-reused metric, any start time after the most recent data point's timestamp will be correct. One solution is to remember the last time any cumulative series was forgotten, call this Another solution is to use As you can see, there is some loss introduced by allowing series to be forgotten. We can no longer rely on the process start time equaling the start time of every metric it produces. This leads to follow-on questions--in particular #1273. |
The java SDK changed its behavior after I posted that "we aggressively forget metric streams - any metric stream that doesn't have recordings during a collection is forgotten". We do still purge series after collection when the temporality is delta, but for cumulative metrics we retain all series and have a upper limit of 2000 timeseries per instrument. If the timeseries limit is exceeded, we log a warning and drop recordings.
What about using the timestamp of last collection as the start time? |
For push-based exporters, this works. For pull-based exporters, in the common case when there is only one scraper pulling data from the SDK, this works. For pull-based exporters will more than one scraper, the scraper would have to provide its last-scrape-timestamp in the request to do this well. If the exporter simply tracks the last-export timestamp by any scraper, you might end up in a situation where the last-collection time is not meaningful. In a Prometheus environment, where typically rate calculations are done based on timeseries vectors, not using point start times, I don't think you will see a meaningful difference--although this is a source of ambiguity when counters reset. The purpose of the start timestamp is to avoid ambiguity over counter reset, so that rate calculations can be made easily. If the last-collection timestamp is used as the start timestamp for new series, you may end up with OTLP data points with extreme-value instantaneous rates. |
@legendecas made an insightful remark here, open-telemetry/opentelemetry-js#2997 (comment), noting that temporality is irrelevant to this problem. Cumulative/Synchronous and Delta/Asychronous configurations both have the same problem. In both cases, a user may stop reporting a set of attributes and then possibly resume reporting the same set at some point in the distant future. Assuming we define a circuit-breaker mechanism, such as a configuration to "expunge" instrument/attributes that have become stale (without loss of information => addresses long-term risk) or to drop instrument/attributes that are presently overflowing (with loss of information => addresses short-term risk), we have an independent problem of how to convey that to the OTLP or PRW consumer. I propose to leave this issue open as requesting a definition for the circuit-breaker mechanism, examples include:
I will file a new issue to discuss how the SDK should "safely" expunge these series. |
It was pointed out in this discussion that Prometheus allows users to define their own mechanism to purge their own data.
Might be worth specifying a similar mechanism instead of a prescriptive drop policy. That way the most informed person about the data, the user, can make the decision. |
We discussed this in the Prometheus working group today because we feel it is important for OTel and Prometheus to align on how we handle cardinality explosions. We placed our attention on these configuration options documented for the Prometheus server, particularly
We are interested in describing what should happen for Push and Pull exporters when a cardinality limit is reached. We know that when these limits are reached the entire scrape will be treated as failed (Pull). We also have standard gRPC/HTTP message-size limits that, however configured, can cause an entire batch of metrics to be "treated as failed" without further specification (as we have not specified any ways of splitting an OTLP metrics payload). Therefore, we are interested in allowing the user to configure limits at a fine grain to avoid the entire loss of metrics scenario. Cardinality limits could be applied at:
That said, these configuration settings should be considered alongside other means of controlling cardinality. Currently, we have these other approaches:
Speaking as a user, I've already experienced the limit described above for gRPC pushing OTLP payloads that exceed the configured limit. I would like to be able to build in isolation so that a single instrumentation library cannot cause entire loss of metrics. I believe the instrumentation scope is the natural place to place limits such as the total byte size and data point count. When an instrumentation scope exceeds its limits, the following should happen: a. The entire batch of data from that scope fails. Presumably, this will be observable in the future sense that we want to have metrics about the SDK's success and failure, and these metrics can include the instrumentation scope name/version/attributes. These recommendations are in-line with Prometheus in that Views should be applied before comparing against limits and "entire failure" is the consequence when limits are met. |
IMO I think it's imperative to have an explicit way to delete a time series, and I've set up an issue to discuss this: #3062 (I searched for an issue and couldn't find one, not sure GitHub search is that good). In this issue, I gave a real-world example of why it is needed. |
For sync instruments, IMO you should never drop at the instrument level, and leave it to exporters to determine if they can support the cardinality produced.
At least for prometheus, this is unnecessary, as staleness tracking is done by the scraper, not by the client. The client simply stops exposing the timeseries in question on the scrape endpoint.
I agree with @jack-berg above. It should be the time of the last collection (i.e. the last time the callback is invoked). At that point in time, a callback was invoked and no value was returned. If there are multiple scrapers, the start time is still correct in this case. If we had scrapes for a particular label set: t0: "A" scrapes -> No value "A" would receive (at t2 scrape):
"B" would receive (at t3 scrape):
I believe this is correct, as the value X implicitly was accumulated between t1 and t2, and Y implicitly accumulated between t1 and t3. While a start time of t0 (instead of t1) for the "A" scraper might also be correct (if we were tracking start time per-scraper), t1 is also correct. For prometheus, the built-in standard sync instruments (e.g. gauge don't support Delete()(at least not in go). You can use a raw MetricVec to delete label values, but that isn't very common from my experience. It is documented as a building block for building your own synchronous instruments. More common is to implement a custom collector (e.g. cadvisor which is roughly the equivalent of a callback, as it allows the user to put the current value of metrics into a channel (in go). For users who need a mechanism to report metrics which have a bounded cardinality in a particular collection round, but unbounded cardinality over time (e.g. something like kube-state-metrics which reports a metric for each pod in a kubernetes cluster, but which is likely to have about the same number of pods in the cluster at a point-in-time), I would advocate they use asynchronous instruments to intentionally "forget" label sets that are no longer relevant, and that the SDK treat the non-reporting of a label set as a signal that it should no longer export or cache information about it. |
What can they use to report a histogram, as it doesn't have an asynchronous version? |
They wouldn't be able to use the approach above today, and would need what you are proposing.
I don't want to delve too deeply into k8s internals, but the kube client already maintains the set of watched objects in memory for us. It would likely be simpler to use an async instrument based on listing objects from the client's cache. |
Fixes #1891. **EDIT: Updated to specify cardinality limits at the View/Instrument level with a Reader-level default. Updated to use a hard limit** ## Changes Adds optional support for a maximum cardinality limit. The recommended default is 2000, based on this comment by #1891 (comment) @jack-berg. ~The Prometheus-WG SIG discussed this on Nov 9, 2022 and reached this recommended solution to the problem outlined in #1891. The consequence of exceeding these limits is in line with the current Prometheus server behavior, which drops targets that misbehave. The discussed was summarized here: #1891 (comment)
…#2960) Fixes open-telemetry#1891. **EDIT: Updated to specify cardinality limits at the View/Instrument level with a Reader-level default. Updated to use a hard limit** ## Changes Adds optional support for a maximum cardinality limit. The recommended default is 2000, based on this comment by open-telemetry#1891 (comment) @jack-berg. ~The Prometheus-WG SIG discussed this on Nov 9, 2022 and reached this recommended solution to the problem outlined in open-telemetry#1891. The consequence of exceeding these limits is in line with the current Prometheus server behavior, which drops targets that misbehave. The discussed was summarized here: open-telemetry#1891 (comment)
It is possible (perhaps even likely) that instrumentation will generate metrics with cardinality explosions. There may be backends that can handle this, so it's not necessarily an error in the instrumentation. However, it will often be a mistake in the instrumentation, and the SDK needs to include some sort of mechanism so that the the application will not die with a memory leak in this case. This could especially happen with a pull-based exporter if nothing is pulling the data from it, for example.
The SDK specification should define the standard strategy for this circuit breaker. There are at least several possible options to solving this, and it would be good if all languages implemented the same default strategy.
Some options:
a) Count the cardinality and start dropping recordings for instruments that exceed it.
b) Watch for specific dimensions that are causing the explosion and start automatically dropping those dimensions.
c) Something more clever.
The text was updated successfully, but these errors were encountered: