Start publish at randomized time in next step #3750

shakuzen · 2023-04-10T11:00:18Z

This avoids different instances calling publish at the same time by scheduling them at a time randomly chosen within the next step upon start being called. It will avoid scheduling publishing at the last 20% of a step interval to avoid the issue described in gh-1218 where publishing could take long enough to finish in a subsequent step after it started. An additional background thread is added for StepMeterRegistry that will poll all meters at the start of the step, separate from when publishing happens. This is so the values published are the ones recorded in the previous step. We need rollCount to be called for step meters as early as possible in the step to minimize the number of recordings reported for the wrong step. This is not perfect, but it is no worse than the status quo.

Resolves gh-2818

micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushMeterRegistry.java

micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java

sonatype-lift · 2023-04-10T11:16:35Z

🛠 Lift Auto-fix

Some of the Lift findings in this PR can be automatically fixed. You can download and apply these changes in your local project directory of your branch to review the suggestions before committing.¹

# Download the patch
curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3750.diff -o lift-autofixes.diff

# Apply the patch with git
git apply lift-autofixes.diff

# Review the changes
git diff

Want it all in a single command? Open a terminal in your project's directory and copy and paste the following command:

curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3750.diff | git apply

Once you're satisfied, commit and push your changes in your project.

You can preview the patch by opening the patch URL in the browser. ↩

This avoids different instances calling publish at the same time. It will avoid scheduling publishing at the last 20% of a step interval to avoid the issue described in micrometer-metricsgh-1218 where publishing could take long enough to finish in a subsequent step after it started. An additional background thread is added for StepMeterRegistry that will poll all meters at the start of the step, separate from when publishing happens. This is so the values published are the ones recorded in the previous step. We need rollCount to be called for step meters as early as possible in the step to minimize the number of recordings reported for the wrong step. This is not perfect, but it is no worse than the status quo.

lenin-jaganathan · 2023-04-10T13:58:33Z

micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushMeterRegistry.java

 long stepMillis = config.step().toMillis();
- long initialDelayMillis = stepMillis - (clock.wallTime() % stepMillis) + 1;
+ long initialDelayMillis = calculateInitialDelay();


This in a way enforces registries to complete the publish() in 20% of the step time. And with the issues referenced in #3717, this could be a serious issue for users, unless there is an option to preserve the existing behavior.

If the step registries can publish within 20% of the step time, then I believe that alone would ensure that we are reading the data for right step.

If you aren't publishing within 20% of the step time, that's probably a separate issue to look into. I would think there must be something wrong with the registry implementation, high cardinality, not enough CPU/memory, or not enough resources for your metrics backend. Prior to Micrometer 1.7, we published at any arbitrary time in the step, and I don't think anyone reported publishing taking so long. If publishing takes a long time, I don't think the answer is to publish at the beginning of the step to give the most time possible. That just exacerbates the issue because it will slow down shared resources (network, metrics backend) when all of your instances try to use them at the same time.

If the step registries can publish within 20% of the step time, then I believe that alone would ensure that we are reading the data for right step.

Not if we're scheduling at the end of the step. That's the whole point. Registries still need to publish in a reasonable amount of time, and we need to look into cases where that isn't happening.

Not if we're scheduling at the end of the step.

I agree with this. But, what I meant is if 20% is reasonable enough time to publish we can do the publishing anytime. But, yeah that will bring the issue of Step's not reporting data for the particular step. Fair enough.

But, I would probably have some config to have the old behaviour for at least next 1 or 2 versions. I will leave you guys to decide on that.

We try very hard to not break compatibility, which means it is very hard for us to remove something once exposed as public API. This is one of the reasons I try very hard to not make things configurable that in theory we believe users should not need to adjust. I would like to get more feedback on this with RC1 and we can consider making it configurable in GA or the next release if there's a need. Let's look into it more.

lenin-jaganathan · 2023-04-10T14:11:23Z

micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java

+ // VisibleForTesting
+ void pollMetersToRollover() {
+ this.getMeters()
+ .forEach(m -> m.match(gauge -> null, Counter::count, Timer::count, DistributionSummary::count,


I see the point in just calling count on timers will be faster, but I also could see this can get trickier where the timer snapshots(max/histograms especially if things are step based) will not be in sync with the counts and totals.

I think I might be missing the point you're making. Max and histograms rotate on writing and reading, so they don't have the same need for this as count/total do. The max and histograms rotation are not aligned to the epoch step anyway.

Yeah, they are. Oh, wait this is on StepRegistry which should be fine. Thanks for the clarification. I might need to do some testing with the case of SignalFxTimer which is an abstract one in the Step registry. I will get back to that.

micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushMeterRegistry.java

jonatan-ivanov · 2023-04-10T22:04:18Z

micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushMeterRegistry.java

+ long randomOffsetWithinStep = (long) (stepMillis * random.nextDouble()
+ * PERCENT_RANGE_OF_RANDOM_PUBLISHING_OFFSET);
+ long offsetToStartOfNextStep = stepMillis - clock.wallTime() % stepMillis;
+ return offsetToStartOfNextStep + 2 + randomOffsetWithinStep;


Can we add a comment that says something like:
Wait till the next step + a little more randomly somewhere between 0 and 80% of the step interval + 2ms?

Btw why 2ms? Because the rollover is +1?

why 2ms? Because the rollover is +1?

Yes.

I added a polish commit with a comment explaining. I also changed the implementation slightly so that the 2ms isn't increasing the latest time a publish can be scheduled. It shouldn't matter for anything but very small steps (like the test case is using).

nit: jitter might be a good variable name for the random value.

micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java

jonatan-ivanov · 2023-04-10T22:14:16Z

micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java

+ * This will poll the values from meters, which will cause a roll over for Step-meters
+ * if past the step boundary. This gives some control over when roll over happens
+ * separate from when publishing happens.


I'm thinking if we can get into trouble when the rollover and the publisher schedulers are "too close" to each other and since there is some indeterminism in them (they are not scheduling at the exact time), at one point, the order of the schedules are the opposite so we lose one step worth of data: publish, rollover, rollover, publish.

I'm thinking maybe instead of publishing between 0 and 80% of the step, we should do it between 10% (or 5%?) and 80%.
With 10-80%, this would look like:

static final double PERCENT_RANGE_OF_RANDOM_PUBLISHING_LOW_OFFSET = 0.1; static final double PERCENT_RANGE_OF_RANDOM_PUBLISHING_OFFSET_SIZE = 0.7; // HIGH - LOW = 0.8 - 0.1 = 0.7 long randomOffsetWithinStep = (long) ((stepMillis * random.nextDouble() * PERCENT_RANGE_OF_RANDOM_PUBLISHING_OFFSET_SIZE) + (stepMillis * PERCENT_RANGE_OF_RANDOM_PUBLISHING_LOW_OFFSET));

With a step equal to 60s, this would mean random publishing between 6s (10%) and 48s (80%) in the current step.

Or maybe just a constant 10-100ms is enough instead of the current 2ms. I'm also thinking how stop-the-world/app stalls can mess things up.

I agree that this can be reordered. But is that really a worry? Maybe or May not be.

The calls to poll() of a StepMeter is meant to be idempotent for a give step. We take a snapshot once and use that for calls until next step approach which doesn't care if it is publisher or pollService. The advantage with pollService is that it will make these measurements align more to step i,e since the pollService only rollsOver the step, it is meant to be faster and finish rolling over all the meters within some ms so the data polled during publishing is exactly for 00:00:00.xxx to 00:00:01.yyy. Here xxx the completion of last poll and yyy is the completion of current poll.

Please ignore, if your question is completely different 🙂.

@shakuzen Do you think it makes sense to have StepMeter interface have something like rollOver() method which will specifically do that to make things more explicit? or may be that is too much for what we are doing ?

I talked to @jonatan-ivanov about this a little bit ago. This is fine since polling things multiple times in the same step interval has no effect beyond the first time. It doesn't matter which of the scheduled tasks happens first. We just want polling to happen at the beginning of the step.

Do you think it makes sense to have StepMeter interface have something like rollOver() method which will specifically do that to make things more explicit? or may be that is too much for what we are doing ?

I considered it, but I think it's too much for now. If we see a need for it later we can introduce it, but hopefully it is clear enough now what we are doing and why.

…ep/StepMeterRegistry.java Co-authored-by: Jonatan Ivanov <jonatan.ivanov@gmail.com>

…sh/PushMeterRegistry.java Co-authored-by: Jonatan Ivanov <jonatan.ivanov@gmail.com>

See micrometer-metricsgh-3750

See gh-3750

sonatype-lift bot reviewed Apr 10, 2023

View reviewed changes

micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushMeterRegistry.java Show resolved Hide resolved

sonatype-lift bot reviewed Apr 10, 2023

View reviewed changes

micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java Show resolved Hide resolved

sonatype-lift bot reviewed Apr 10, 2023

View reviewed changes

micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java Show resolved Hide resolved

shakuzen force-pushed the randomize-publish-time branch from 028ed96 to 6fb78b5 Compare April 10, 2023 11:28

lenin-jaganathan reviewed Apr 10, 2023

View reviewed changes

jonatan-ivanov approved these changes Apr 10, 2023

View reviewed changes

shakuzen and others added 4 commits April 11, 2023 12:16

Update micrometer-core/src/main/java/io/micrometer/core/instrument/st…

9356bbf

…ep/StepMeterRegistry.java Co-authored-by: Jonatan Ivanov <jonatan.ivanov@gmail.com>

Update micrometer-core/src/main/java/io/micrometer/core/instrument/pu…

3b2afb4

…sh/PushMeterRegistry.java Co-authored-by: Jonatan Ivanov <jonatan.ivanov@gmail.com>

Polish

5bee281

Polish

874cfa3

shakuzen merged commit c167402 into micrometer-metrics:main Apr 11, 2023

shakuzen deleted the randomize-publish-time branch April 11, 2023 04:34

This was referenced Apr 11, 2023

Publish relative to PushMeterRegistry initialization time and align StepMeter boundaries to that #3450

Closed

DeltaHistogram in SignalFx registry doesn't align with count and total #3774

Closed

lenin-jaganathan mentioned this pull request Apr 30, 2023

Avoid global publishing alignment in OTLP registry #3794

Merged

jonatan-ivanov mentioned this pull request May 4, 2023

Backport fix for partial step values on closing publish #3759

Closed

This was referenced May 11, 2023

"Duplicate call to publish" error since 1.11.0-RC1 #3832

Closed

Current step data is not published on shutdown when publishing is in progress #3846

Closed

lenin-jaganathan mentioned this pull request May 19, 2023

Align DeltaHistogram in SignalFx registry with count and total #3799

Merged

2 tasks

lenin-jaganathan mentioned this pull request May 30, 2023

StepRegistry publish on close before scheduled publish loses previous step data #3863

Closed

izeye mentioned this pull request May 31, 2023

Polish "Start publish at randomized time in next step" #3871

Merged

izeye added a commit to izeye/micrometer that referenced this pull request May 31, 2023

Polish micrometer-metricsgh-3750

c3a8ae7

See micrometer-metricsgh-3750

jonatan-ivanov pushed a commit that referenced this pull request May 31, 2023

Polish gh-3750 (#3871)

cd31d94

See gh-3750

lenin-jaganathan mentioned this pull request Jun 6, 2023

Micrometer with <1.5 seconds step size no longer sends metrics to Influx in version 1.11.0 #3879

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start publish at randomized time in next step #3750

Start publish at randomized time in next step #3750

shakuzen commented Apr 10, 2023

sonatype-lift bot commented Apr 10, 2023

lenin-jaganathan Apr 10, 2023

shakuzen Apr 10, 2023

lenin-jaganathan Apr 10, 2023

shakuzen Apr 11, 2023

lenin-jaganathan Apr 10, 2023

shakuzen Apr 10, 2023

lenin-jaganathan Apr 10, 2023

jonatan-ivanov Apr 10, 2023 •

edited

Loading

shakuzen Apr 11, 2023

shakuzen Apr 11, 2023

ttddyy Apr 11, 2023

jonatan-ivanov Apr 10, 2023 •

edited

Loading

lenin-jaganathan Apr 11, 2023

shakuzen Apr 11, 2023

Start publish at randomized time in next step #3750

Start publish at randomized time in next step #3750

Conversation

shakuzen commented Apr 10, 2023

sonatype-lift bot commented Apr 10, 2023

🛠 Lift Auto-fix

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonatan-ivanov Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonatan-ivanov Apr 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonatan-ivanov Apr 10, 2023 •

edited

Loading

jonatan-ivanov Apr 10, 2023 •

edited

Loading