-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start publish at randomized time in next step #3750
Start publish at randomized time in next step #3750
Conversation
micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushMeterRegistry.java
Show resolved
Hide resolved
micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java
Show resolved
Hide resolved
micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java
Show resolved
Hide resolved
🛠 Lift Auto-fixSome of the Lift findings in this PR can be automatically fixed. You can download and apply these changes in your local project directory of your branch to review the suggestions before committing.1 # Download the patch
curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3750.diff -o lift-autofixes.diff
# Apply the patch with git
git apply lift-autofixes.diff
# Review the changes
git diff Want it all in a single command? Open a terminal in your project's directory and copy and paste the following command: curl https://lift.sonatype.com/api/patch/github.com/micrometer-metrics/micrometer/3750.diff | git apply Once you're satisfied, commit and push your changes in your project. Footnotes |
This avoids different instances calling publish at the same time. It will avoid scheduling publishing at the last 20% of a step interval to avoid the issue described in micrometer-metricsgh-1218 where publishing could take long enough to finish in a subsequent step after it started. An additional background thread is added for StepMeterRegistry that will poll all meters at the start of the step, separate from when publishing happens. This is so the values published are the ones recorded in the previous step. We need rollCount to be called for step meters as early as possible in the step to minimize the number of recordings reported for the wrong step. This is not perfect, but it is no worse than the status quo.
028ed96
to
6fb78b5
Compare
long stepMillis = config.step().toMillis(); | ||
long initialDelayMillis = stepMillis - (clock.wallTime() % stepMillis) + 1; | ||
long initialDelayMillis = calculateInitialDelay(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This in a way enforces registries to complete the publish() in 20% of the step time. And with the issues referenced in #3717, this could be a serious issue for users, unless there is an option to preserve the existing behavior.
If the step registries can publish within 20% of the step time, then I believe that alone would ensure that we are reading the data for right step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you aren't publishing within 20% of the step time, that's probably a separate issue to look into. I would think there must be something wrong with the registry implementation, high cardinality, not enough CPU/memory, or not enough resources for your metrics backend. Prior to Micrometer 1.7, we published at any arbitrary time in the step, and I don't think anyone reported publishing taking so long. If publishing takes a long time, I don't think the answer is to publish at the beginning of the step to give the most time possible. That just exacerbates the issue because it will slow down shared resources (network, metrics backend) when all of your instances try to use them at the same time.
If the step registries can publish within 20% of the step time, then I believe that alone would ensure that we are reading the data for right step.
Not if we're scheduling at the end of the step. That's the whole point. Registries still need to publish in a reasonable amount of time, and we need to look into cases where that isn't happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not if we're scheduling at the end of the step.
I agree with this. But, what I meant is if 20% is reasonable enough time to publish we can do the publishing anytime. But, yeah that will bring the issue of Step's not reporting data for the particular step. Fair enough.
But, I would probably have some config to have the old behaviour for at least next 1 or 2 versions. I will leave you guys to decide on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try very hard to not break compatibility, which means it is very hard for us to remove something once exposed as public API. This is one of the reasons I try very hard to not make things configurable that in theory we believe users should not need to adjust. I would like to get more feedback on this with RC1 and we can consider making it configurable in GA or the next release if there's a need. Let's look into it more.
// VisibleForTesting | ||
void pollMetersToRollover() { | ||
this.getMeters() | ||
.forEach(m -> m.match(gauge -> null, Counter::count, Timer::count, DistributionSummary::count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point in just calling count on timers will be faster, but I also could see this can get trickier where the timer snapshots(max/histograms especially if things are step based) will not be in sync with the counts and totals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might be missing the point you're making. Max and histograms rotate on writing and reading, so they don't have the same need for this as count/total do. The max and histograms rotation are not aligned to the epoch step anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they are. Oh, wait this is on StepRegistry which should be fine. Thanks for the clarification. I might need to do some testing with the case of SignalFxTimer which is an abstract one in the Step registry. I will get back to that.
micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushMeterRegistry.java
Outdated
Show resolved
Hide resolved
long randomOffsetWithinStep = (long) (stepMillis * random.nextDouble() | ||
* PERCENT_RANGE_OF_RANDOM_PUBLISHING_OFFSET); | ||
long offsetToStartOfNextStep = stepMillis - clock.wallTime() % stepMillis; | ||
return offsetToStartOfNextStep + 2 + randomOffsetWithinStep; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment that says something like:
Wait till the next step + a little more randomly somewhere between 0 and 80% of the step interval + 2ms?
Btw why 2ms? Because the rollover is +1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 2ms? Because the rollover is +1?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a polish commit with a comment explaining. I also changed the implementation slightly so that the 2ms isn't increasing the latest time a publish can be scheduled. It shouldn't matter for anything but very small steps (like the test case is using).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: jitter might be a good variable name for the random value.
micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java
Outdated
Show resolved
Hide resolved
* This will poll the values from meters, which will cause a roll over for Step-meters | ||
* if past the step boundary. This gives some control over when roll over happens | ||
* separate from when publishing happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if we can get into trouble when the rollover and the publisher schedulers are "too close" to each other and since there is some indeterminism in them (they are not scheduling at the exact time), at one point, the order of the schedules are the opposite so we lose one step worth of data: publish, rollover, rollover, publish.
I'm thinking maybe instead of publishing between 0 and 80% of the step, we should do it between 10% (or 5%?) and 80%.
With 10-80%, this would look like:
static final double PERCENT_RANGE_OF_RANDOM_PUBLISHING_LOW_OFFSET = 0.1;
static final double PERCENT_RANGE_OF_RANDOM_PUBLISHING_OFFSET_SIZE = 0.7; // HIGH - LOW = 0.8 - 0.1 = 0.7
long randomOffsetWithinStep = (long) ((stepMillis * random.nextDouble() * PERCENT_RANGE_OF_RANDOM_PUBLISHING_OFFSET_SIZE) + (stepMillis * PERCENT_RANGE_OF_RANDOM_PUBLISHING_LOW_OFFSET));
With a step equal to 60s, this would mean random publishing between 6s (10%) and 48s (80%) in the current step.
Or maybe just a constant 10-100ms is enough instead of the current 2ms. I'm also thinking how stop-the-world/app stalls can mess things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this can be reordered. But is that really a worry? Maybe or May not be.
The calls to poll() of a StepMeter is meant to be idempotent for a give step. We take a snapshot once and use that for calls until next step approach which doesn't care if it is publisher or pollService. The advantage with pollService is that it will make these measurements align more to step i,e since the pollService only rollsOver the step, it is meant to be faster and finish rolling over all the meters within some ms so the data polled during publishing is exactly for 00:00:00.xxx to 00:00:01.yyy. Here xxx the completion of last poll and yyy is the completion of current poll.
Please ignore, if your question is completely different 🙂.
@shakuzen Do you think it makes sense to have StepMeter interface have something like rollOver() method which will specifically do that to make things more explicit? or may be that is too much for what we are doing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked to @jonatan-ivanov about this a little bit ago. This is fine since polling things multiple times in the same step interval has no effect beyond the first time. It doesn't matter which of the scheduled tasks happens first. We just want polling to happen at the beginning of the step.
Do you think it makes sense to have StepMeter interface have something like rollOver() method which will specifically do that to make things more explicit? or may be that is too much for what we are doing ?
I considered it, but I think it's too much for now. If we see a need for it later we can introduce it, but hopefully it is clear enough now what we are doing and why.
…ep/StepMeterRegistry.java Co-authored-by: Jonatan Ivanov <jonatan.ivanov@gmail.com>
…sh/PushMeterRegistry.java Co-authored-by: Jonatan Ivanov <jonatan.ivanov@gmail.com>
This avoids different instances calling publish at the same time by scheduling them at a time randomly chosen within the next step upon start being called. It will avoid scheduling publishing at the last 20% of a step interval to avoid the issue described in gh-1218 where publishing could take long enough to finish in a subsequent step after it started. An additional background thread is added for StepMeterRegistry that will poll all meters at the start of the step, separate from when publishing happens. This is so the values published are the ones recorded in the previous step. We need rollCount to be called for step meters as early as possible in the step to minimize the number of recordings reported for the wrong step. This is not perfect, but it is no worse than the status quo.
Resolves gh-2818