Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfluxDB Counter value is off since upgrading to 1.7.x #2787

Closed
cykl opened this issue Sep 16, 2021 · 11 comments
Closed

InfluxDB Counter value is off since upgrading to 1.7.x #2787

cykl opened this issue Sep 16, 2021 · 11 comments
Labels
bug A general bug closed-as-inactive registry: influx An InfluxDB Registry related issue

Comments

@cykl
Copy link

cykl commented Sep 16, 2021

Describe the bug

Since an application adopted Micrometer 1.7.x, the value of a counter is frequently off by a significant factor (almost half of the expected value). However, most counters seems fine.

I'm still reviewing the application and trying to create a reproducer, but I strongly suspect the application is correct and that the counter is properly incremented (some downstream counters & observations confirm that counter has been / should have been incremented).

Environment

  • Micrometer version: 1.7.x (tested 1.7.0 and 1.7.4). Works fine on 1.6.x.
  • Micrometer registry: InfluxDB
  • OS: Linux
  • Java version: 11.0.12+7

To Reproduce

I'm working on a reproducer and will update the issue ASAP. Meanwhile I can provide related code and would appreciate an outside perspective on the correctness of the code.

  1. Micrometer is configured with an InfluxDB registry and some MeterFilter (ex. rewrite the "[app]" prefix, add some common tags etc.) .
  2. On the happy path application invoke transformMetrics.countSuccess( modelName );. By other means, I can confirm that countSuccess is steadily invoked.
  3. TransformMetrics is a small helper to ensure that all counters are properly initialized from an enum.
    static final class TransformMetrics {

        TransformMetrics( ) {
            initializeCounters( );
        }

        private void initializeCounters( ) {
            for( FooModel FooModel : FooModel.values( ) ) {
                var modelName = FooModel.modelName( );
                getCounter( modelName, EventOutcome.SUCCESS, null ).increment( 0 );
                for( EventOutcomeReason reason : EventOutcomeReason.values( ) ) {
                    getCounter( modelName, EventOutcome.FAILURE, reason ).increment( 0 );
                }
            }
        }

        private Counter getCounter(
            String deviceModel, EventOutcome eventOutcome, @Nullable EventOutcomeReason reason ) {

            var tags = Tags.of( XxxTags.NO_APPLICATION_REPLICA_ID ).and(
                    "device-model", deviceModel,
                    "outcome", eventOutcome.toString( ) );

            var reasonTag = reason != null ? reason.toString( ) : "";
            tags = tags.and( "reason", reasonTag );

            return Metrics.counter( "[app]." + EVENT_ACTION, tags );
        }

        void countError( EventOutcomeReason eventOutcomeReason, String deviceModel ) {
            getCounter( deviceModel, EventOutcome.FAILURE, eventOutcomeReason ).increment( );
        }

        void countSuccess( String deviceModel ) {
            getCounter( deviceModel, EventOutcome.SUCCESS, null ).increment( );
        }
    }

With 1.6.x counter is steadily incremented as expected (using upstream / downstream counters and other means I can confirm that 9 ops/m is expected )

image

With 1.7.x counter rate is sometime 9 ops/m but frequently less than that (again I can confirm that 9 ops/m is expected)

image

Expected behavior

Micrometer 1.7.x should behave like 1.6.x.

Additional context

I apologize for the lengthy and messy report. I have not figured yet what is going on and can't provide a clean reproducer yet. I'm wondering if any change between latest 1.6.x and earliest 1.7.x might explain such issue.

Most other counters seems perfectly fine 🤷

@cykl cykl added the bug A general bug label Sep 16, 2021
@cykl
Copy link
Author

cykl commented Sep 16, 2021

I just tried 1.8.0-M3, same behavior as 1.7.x.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Sep 16, 2021

@cykl Thank for the report. For a reproducer, you can take a look into our samples, e.g.: CounterSample (you can modify it to make it less random and increment the counter every second).

Another thing that might worth to mention is that in 1.7 we introduced V2 support for InfluxDB.
Could you please check if you are using V1 or V2?

Update: Another quick question: why do you increment your counters with 0?

@jonatan-ivanov jonatan-ivanov added the waiting for feedback We need additional information before we can continue label Sep 16, 2021
@shakuzen shakuzen added the registry: influx An InfluxDB Registry related issue label Sep 17, 2021
@cykl
Copy link
Author

cykl commented Sep 17, 2021

@jonatan-ivanov

We use InfluxDB v1 (1.8.x).

Increment 0 is indeed unnecessary and likely a slip-up. Meter registry will report meter value as soon as it is created no need to perform a 0 increment. We only care about all meters being created at startup rather than popping up the first time an error occur.

Unfortunately, the hard part in providing a reproducer is being able to locally reproduce the issue; writing a test case will be the easy part. I'm trying to strip down the application while preserving the issue but failed until now. I will keep trying.

I just reviewed latest changes made to micrometer-registry-influx but didn't spot anything that could be relevant to what I'm observing.

@cykl
Copy link
Author

cykl commented Sep 17, 2021

@jonatan-ivanov I think I am on the right track.

Being unable to create a minimal test case, I started digging into time-series data and observed what look like missing samples. Let's say I have two replicas of an application, some steps have two samples and some steps have only one sample. Timestamps from two replicas are also often very close, like millisecond apart.

Going down the rabbit hole I found e6e6896 Align publication time with StepValue rollover (#1218). This commit first hits 1.7.0 which match my observations while bisecting over releases.

git tag --contains e6e68964626809998047036aed980823d6ed991f
v1.7.0
v1.7.0-M1
v1.7.0-RC1
v1.7.1
v1.7.2
v1.7.3
v1.7.4
v1.8.0-M1
v1.8.0-M2
v1.8.0-M3

In InfluxDB, a point represents a single data record. Each point:

  • has a measurement, a tag set, a field key, a field value, and a timestamp;
  • is uniquely identified by its series and timestamp.

Before 1.7.0, each replica was publishing its points depending on its start time. Given a small number of replicas, it was quite unlikely that two replicas would push a point with the same timestamp into a series. Odds are now much higher, practically it happens very often.

This explain why not all our time-series are impacted. The XxxTags.NO_APPLICATION_REPLICA_ID tag you can see in my initial report triggers a MeterFiler that removes the replica-id tag we usually put on time-series to avoid creating too many time-series. Meter without this tag are not impacted has two replicas never write into the same series.

Notes:

  • Micrometer set precision to ms on POST /write. Given the intended coordination between MeterRegistry, it's seems quite likely that two registry write into the same series with the same timestamp at millisecond precision
  • This change likely put much more stress onto time-series databases as all registries will now write at the same time, leading to very bursty workload

Does it make sense? Do you see an obvious workaround and long term fix?

@shakuzen
Copy link
Member

It sounds like the issue is that the time series between multiple instances of the same application are not unique. If you add a tag to uniquely identify each instance, would that resolve the issue for you?

@cykl
Copy link
Author

cykl commented Sep 21, 2021

It would.

However, it would also create new issues due to cardinality explosion; which is the major limiting factory of time-series databases. It would be great to not have to have to deal with such a new micrometer-specific limitation. At the very least, it should be clearly documented until we figure out a fix.

Could you confirm my understanding that all replicas will now push their points at the exact same time?

My gut feeling is that we should try to break this coordination between JVMs. Couldn't we align step & publication within a MeterRegistry without aligning steps across all JVM instances?

If not, maybe could we do some ugly hacks like setting precision to ns and using the nanos part as unique identifier.

@shakuzen
Copy link
Member

Tagging metrics to identify time series from unique instances is common practice for metrics and not micrometer specific or new. InfluxDB is essentially enforcing this by not allowing multiple publishes at the same time for the same time series that should come from only one instance. Depending on your environment, a hostname or a platform-provided instance ID tag suffices. Without this, how would you track a time series from the same instance? This isn't generally an area to save on cardinality.

Could you confirm my understanding that all replicas will now push their points at the exact same time?

Yes, that appears to be the case now.
The distributed coordination on publication time is a side effect we probably overlooked in merging #1218. Thank you for the investigation on that. While we want consistency within a process (or registry), we don't want coordination across processes on publishing, as you suggested.
That said, the publication happening at a different millisecond across instances should not be relied upon either.

@cykl
Copy link
Author

cykl commented Oct 27, 2021

I don't fully agree with this response, but I get your point and understand the reasoning behind it. As I don't think it's nonsensical, let me try to explain what we were doing and why.

Tagging metrics to identify time series from unique instances is common practice for metrics and not micrometer specific or new.

Indeed it's a common practice and some other monitoring systems like Prometheus even enforce it. Some others don't and even encourage you to not do so because of cost (i.e. per time-series billing) or scalability issues.

InfluxDB is essentially enforcing this by not allowing multiple publishes at the same time for the same time series that should come from only one instance.

I would not phrase it like that. Unlike Prometheus, InfluxDB isn't a monitoring system but a TSBD. InfluxDB concepts are well explained here. It doesn't care what a series represent, which labels should or shouldn't be present or how multiple publisher should cooperate to publish into the same time-series.

Depending on your environment, a hostname or a platform-provided instance ID tag suffices. Without this, how would you track a time series from the same instance?

Some use cases just don't care, ex. service level aggregates. Obviously, you can compute service level aggregate from instance time series but this might have a non negligible cost. Most prominent example is likely CloudWatch. You cannot compute service level aggregate from instance time series and publishing per instance time series is often too expensive.

As Micrometer use client side aggregation for InfluxDB counters, I might just want to drop the instance tag and publish service level time series.

This isn't generally an area to save on cardinality.

I would be more nuanced. Cardinality is usually the main scalability limit and cost factor.

Let's say a specific metric has a label with a few thousands distinct values. Publishing service level time-series might be practical and cost efficient, while publishing unneeded instance level time-series would be too costly or just kill the TSDB if anything goes wrong (ex. frequent instance restart and a few dozen instances).

Reducing the cost of the solution by several orders of magnitude by smartly spending my cardinality budget on things that matter to me rather than on useless tags sound more like a good trade-off than being a cheapskate to me.

@marcingrzejszczak
Copy link
Contributor

Can we assume that @shakuzen 's answer solves this issue? Or should we do more in your opinion @cykl?

Copy link

github-actions bot commented Jan 2, 2024

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

Copy link

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 10, 2024
@jonatan-ivanov jonatan-ivanov removed waiting for feedback We need additional information before we can continue feedback-reminder labels Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug closed-as-inactive registry: influx An InfluxDB Registry related issue
Projects
None yet
Development

No branches or pull requests

4 participants