Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PushMeterRegistry doesn't align publish() with step boundaries #1218

Closed
john-karp opened this issue Feb 7, 2019 · 2 comments · Fixed by #2393
Closed

PushMeterRegistry doesn't align publish() with step boundaries #1218

john-karp opened this issue Feb 7, 2019 · 2 comments · Fixed by #2393
Labels
enhancement A general enhancement module: micrometer-core An issue that is related to our core module
Milestone

Comments

@john-karp
Copy link
Contributor

john-karp commented Feb 7, 2019

PushMeterRegistry uses scheduledExecutorService.scheduleAtFixedRate(this::publish, config.step().toMillis(), config.step().toMillis(), TimeUnit.MILLISECONDS) to schedule publish() with an initial delay of config.step().toMillis(). So while the period between invocations is fixed, they may run at any time with respect to the epoch.

StepDouble sets step boundaries relative to the epoch with final long stepTime = now / stepMillis.

As a result, PushMeterRegistry can schedule calls to publish() with arbitrary offsets relative to step boundaries. If we are unlucky, publish() will be scheduled right before a step boundary, so ordinary scheduling variations can lead to publish() sometimes being called before the boundary, and sometimes right after, leading to StepDouble sporadically duplicating or dropping data. Code example here:
https://gist.github.com/john-karp/93b86885b39f9d813345100be2a6496a

The simplest fix for PushMeterRegistry would set the initial delay of the executor service to align it with the steps. (Though I think ScheduledExecutorService cannot be trusted to keep that alignment in the long term when you have a server with ntpd active.)

@izeye
Copy link
Contributor

izeye commented Feb 5, 2021

This seems to be resolved via #2393.

@shakuzen
Copy link
Member

For a visual, let me provide a diagram illustrating the problem. - marks the passage of time, | marks step interval boundaries, p marks publications, s marks when start is called on the MeterRegistry, S1/S2 indicate ordinal step intervals since the start. When a publication happens, the values from the previous step interval are published. Step boundaries are relative to epoch time.

S1      S2         S3         S4
--s|---------p|----------|p-------p|---

In the above case, the MeterRegistry is started at the end of the step interval S1. The first publication will be scheduled one step interval later, at the end of S2. That publication will include any StepValue based metrics with values recorded during S1.

The problem described by this issue is illustrated by the next publish that may happen slightly later than exactly one step interval, pushing it across a step boundary. Therefore the second publication (the second p) happens in S4, publishing values recorded during S3. This means values recorded in S2 are never published. This is the dropping of data referred to in the original description.

Subsequently, if the next publication happened slightly sooner than scheduled, it could happen in the same step as the previous publication, resulting in publishing the same values as the previous publish. I'm not sure if this happens in practice. I can understand scheduling being late but I'm not sure how it would be earlier than expected.

The fix implemented here changed the situation to the following illustration.

S1      S2         S3         S4
--s|p---------|p---------|-p--------|p--

This avoids the above mentioned issues. The first publication happens at the start of the next step interval after start. If there is a delay in publishing, it hopefully is shorter than the step interval, like shown in the publication that happens during S4.

shakuzen added a commit to shakuzen/micrometer that referenced this issue Apr 10, 2023
This avoids different instances calling publish at the same time. It will avoid scheduling publishing at the last 20% of a step interval to avoid the issue described in micrometer-metricsgh-1218 where publishing could take long enough to finish in a subsequent step after it started.
An additional background thread is added for StepMeterRegistry that will poll all meters at the start of the step, separate from when publishing happens. This is so the values published are the ones recorded in the previous step. We need rollCount to be called for step meters as early as possible in the step to minimize the number of recordings reported for the wrong step. This is not perfect, but it is no worse than the status quo.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A general enhancement module: micrometer-core An issue that is related to our core module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants