-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InfluxDB Counter value is off since upgrading to 1.7.x #2787
Comments
I just tried 1.8.0-M3, same behavior as 1.7.x. |
@cykl Thank for the report. For a reproducer, you can take a look into our samples, e.g.: Another thing that might worth to mention is that in 1.7 we introduced V2 support for InfluxDB. Update: Another quick question: why do you increment your counters with 0? |
We use InfluxDB v1 (1.8.x). Increment 0 is indeed unnecessary and likely a slip-up. Meter registry will report meter value as soon as it is created no need to perform a 0 increment. We only care about all meters being created at startup rather than popping up the first time an error occur. Unfortunately, the hard part in providing a reproducer is being able to locally reproduce the issue; writing a test case will be the easy part. I'm trying to strip down the application while preserving the issue but failed until now. I will keep trying. I just reviewed latest changes made to micrometer-registry-influx but didn't spot anything that could be relevant to what I'm observing. |
@jonatan-ivanov I think I am on the right track. Being unable to create a minimal test case, I started digging into time-series data and observed what look like missing samples. Let's say I have two replicas of an application, some steps have two samples and some steps have only one sample. Timestamps from two replicas are also often very close, like millisecond apart. Going down the rabbit hole I found e6e6896 Align publication time with StepValue rollover (#1218). This commit first hits 1.7.0 which match my observations while bisecting over releases.
In InfluxDB, a point represents a single data record. Each point:
Before 1.7.0, each replica was publishing its points depending on its start time. Given a small number of replicas, it was quite unlikely that two replicas would push a point with the same timestamp into a series. Odds are now much higher, practically it happens very often. This explain why not all our time-series are impacted. The Notes:
Does it make sense? Do you see an obvious workaround and long term fix? |
It sounds like the issue is that the time series between multiple instances of the same application are not unique. If you add a tag to uniquely identify each instance, would that resolve the issue for you? |
It would. However, it would also create new issues due to cardinality explosion; which is the major limiting factory of time-series databases. It would be great to not have to have to deal with such a new micrometer-specific limitation. At the very least, it should be clearly documented until we figure out a fix. Could you confirm my understanding that all replicas will now push their points at the exact same time? My gut feeling is that we should try to break this coordination between JVMs. Couldn't we align step & publication within a MeterRegistry without aligning steps across all JVM instances? If not, maybe could we do some ugly hacks like setting precision to ns and using the nanos part as unique identifier. |
Tagging metrics to identify time series from unique instances is common practice for metrics and not micrometer specific or new. InfluxDB is essentially enforcing this by not allowing multiple publishes at the same time for the same time series that should come from only one instance. Depending on your environment, a hostname or a platform-provided instance ID tag suffices. Without this, how would you track a time series from the same instance? This isn't generally an area to save on cardinality.
Yes, that appears to be the case now. |
I don't fully agree with this response, but I get your point and understand the reasoning behind it. As I don't think it's nonsensical, let me try to explain what we were doing and why.
Indeed it's a common practice and some other monitoring systems like Prometheus even enforce it. Some others don't and even encourage you to not do so because of cost (i.e. per time-series billing) or scalability issues.
I would not phrase it like that. Unlike Prometheus, InfluxDB isn't a monitoring system but a TSBD. InfluxDB concepts are well explained here. It doesn't care what a series represent, which labels should or shouldn't be present or how multiple publisher should cooperate to publish into the same time-series.
Some use cases just don't care, ex. service level aggregates. Obviously, you can compute service level aggregate from instance time series but this might have a non negligible cost. Most prominent example is likely CloudWatch. You cannot compute service level aggregate from instance time series and publishing per instance time series is often too expensive. As Micrometer use client side aggregation for InfluxDB counters, I might just want to drop the instance tag and publish service level time series.
I would be more nuanced. Cardinality is usually the main scalability limit and cost factor. Let's say a specific metric has a label with a few thousands distinct values. Publishing service level time-series might be practical and cost efficient, while publishing unneeded instance level time-series would be too costly or just kill the TSDB if anything goes wrong (ex. frequent instance restart and a few dozen instances). Reducing the cost of the solution by several orders of magnitude by smartly spending my cardinality budget on things that matter to me rather than on useless tags sound more like a good trade-off than being a cheapskate to me. |
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed. |
Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open. |
Describe the bug
Since an application adopted Micrometer 1.7.x, the value of a counter is frequently off by a significant factor (almost half of the expected value). However, most counters seems fine.
I'm still reviewing the application and trying to create a reproducer, but I strongly suspect the application is correct and that the counter is properly incremented (some downstream counters & observations confirm that counter has been / should have been incremented).
Environment
To Reproduce
I'm working on a reproducer and will update the issue ASAP. Meanwhile I can provide related code and would appreciate an outside perspective on the correctness of the code.
transformMetrics.countSuccess( modelName );
. By other means, I can confirm that countSuccess is steadily invoked.With 1.6.x counter is steadily incremented as expected (using upstream / downstream counters and other means I can confirm that 9 ops/m is expected )
With 1.7.x counter rate is sometime 9 ops/m but frequently less than that (again I can confirm that 9 ops/m is expected)
Expected behavior
Micrometer 1.7.x should behave like 1.6.x.
Additional context
I apologize for the lengthy and messy report. I have not figured yet what is going on and can't provide a clean reproducer yet. I'm wondering if any change between latest 1.6.x and earliest 1.7.x might explain such issue.
Most other counters seems perfectly fine 🤷
The text was updated successfully, but these errors were encountered: