-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish relative to PushMeterRegistry initialization time and align StepMeter boundaries to that #3450
Publish relative to PushMeterRegistry initialization time and align StepMeter boundaries to that #3450
Conversation
@lenin-jaganathan Please sign the Contributor License Agreement! Click here to manually synchronize the status of this Pull Request. See the FAQ for frequently asked questions. |
@lenin-jaganathan Thank you for signing the Contributor License Agreement! |
micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushRegistryConfig.java
Outdated
Show resolved
Hide resolved
micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushMeterRegistry.java
Outdated
Show resolved
Hide resolved
|
56c03c2
to
14cd011
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good except the initial scheduling delay doesn't look right to me, and I would like to add tests to verify that logic. I will try to improve the naming of some things. And I would like to switch this new behavior to be the default, with a config option to go back to the previous behavior. I will follow-up with commits for that. And I'll rebase against and target the PR at 1.8.x
.
Thank you for the contribution.
micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushMeterRegistry.java
Outdated
Show resolved
Hide resolved
@@ -77,12 +80,18 @@ public void start(ThreadFactory threadFactory) { | |||
scheduledExecutorService = Executors.newSingleThreadScheduledExecutor(threadFactory); | |||
// time publication to happen just after StepValue finishes the step | |||
long stepMillis = config.step().toMillis(); | |||
long initialDelayMillis = stepMillis - (clock.wallTime() % stepMillis) + 1; | |||
long initialDelayMillis = config.publishAtStep() ? stepMillis - (clock.wallTime() % stepMillis) + 1 | |||
: (stepMillis - registryStartMillis) + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look correct to me. Since the meters' step will be shifted by registryStartMillis
shouldn't the initial delay be stepMillis + 1
to publish after one step interval, regardless of the epoch interval? Actually, it is more complicated than that for the case when the registry is stopped and started or if there is at least a millisecond elapsed between initialization and scheduling. For a generic start time, I think it would be:
(edit: updated) stepMillis - ((clock.wallTime() - registryStartMillis) % stepMillis) + 1
Except this also doesn't work if the registry is restarted in between an epoch step boundary and a meter step boundary - it will be stepMillis
too long. Probably there is a simpler way to capture that in an expression that I'm overlooking, but I have a commit that works per my testing.
After flipping the default, there is a test failure. I've run out of time to dig into it, and I'll be on vacation next week. I can come back to this once I'm back from vacation, but feel free to look into the failure in the meantime. |
@shakuzen If the default behavior is made to publish relative to the start time, a few tests will fail because of the way they were written. I will try to address those tests. |
@shakuzen Addressed the test failures in SignalFx library. |
Added an additional commit to make the metrics timestamp synchronized across instances(report timestamp as the start of the current step) for SignalFx Meter Registry. |
ca88535
to
6706f5f
Compare
@@ -104,7 +104,7 @@ else if ("https".equals(apiUri.getScheme())) { | |||
|
|||
@Override | |||
protected void publish() { | |||
final long timestamp = clock.wallTime(); | |||
final long timestamp = (clock.wallTime() / config.step().toMillis()) * config.step().toMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be the epoch-based step boundary. Depending on the configuration, that will not be correct (such as the default configuration). I have a fix for it locally I will push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking more about it, I'm wondering why we would change this. Before this change it is sending the timestamp as the time when publishing starts. Why would we change it to the start of the step with this change? That seems like a separate decision to make not directly related to the change of this pull request. I think I'll revert this change. Please open a separate issue or pull request to consider making this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, the reporting was done at the top of the step (for simplicity let's assume 1min step). And if I understand right all the instances will report almost similar timestamps (i,e start of the minute). With this change, the timestamp will be de-synchronized, it will no more be the start of the minute, what I tried to do was to make it to the top of the minute.
Because when sending data from multiple instances without instanceId (or a unique identifier for an instance), the back-end of SignalFx might be able to calculate what is step a particular metric is reported at. Having said that, I will follow-up on this with a different PR after confirming with someone from SignalFx confirm this.
@@ -457,13 +456,13 @@ void shouldNotExportCumulativeHistogramDataByDefault_Timer() { | |||
mockClock.add(config.step().minus(Duration.ofMillis(1))); | |||
|
|||
assertThat(getDataPoints(registry, mockClock.wallTime())).hasSize(8) | |||
.has(gaugePoint("my.timer.avg", 0.2525), atIndex(0)).has(counterPoint("my.timer.count", 2), atIndex(1)) | |||
.has(gaugePoint("my.timer.avg", 2.525), atIndex(0)).has(counterPoint("my.timer.count", 2), atIndex(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My review comment about investigating the cause of the failures meant understanding exactly how the behavior changed and make sure the change and test made sense. This appears to change the values so the test passes, but I don't think it's demonstrating what the test intended. We need to be clear about what changed and if that's acceptable. I'll write up a summary in another comment.
When investigating this, I found a bug in the assertion logic for hasValue
which caused it to always return true for double values that would have an int value of 0 (like 0.2525). I'll fix that in a separate issue and rebase this pull request on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you pointed out hasValue
had a bug all the time and the tests would have failed if it is right. And the other issue is with the TimeWindowFixedBoundaryHistogram and buffer length. That is the whole reason SignalFx Registry went for a Delta Histogram Implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the tests back to assert what they did before as I think that's what the intention of the tests was. I've had to change the test to record later than the beginning of the step to avoid the histogram counts being rotated out when checking the published data. That whole problem is a preexisting issue that was worked around in the tests by checking the published data at a specific millisecond before the histogram would rotate but after the step-based values (count/sum/avg) would return the desired new values. We can consider if the behavior of histogram rotation is best or not separately from this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
bafeb45
to
ee2afe9
Compare
@shakuzen Do you want me to address anything more on this PR? |
* @return how many milliseconds publishing is offset from Unix Epoch-based step | ||
* intervals | ||
*/ | ||
protected long getPushOffsetFromEpochStepMillis() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the name of this method and added a JavaDoc in hopes of making things more clear.
* @return false if publishing should be scheduled relative to registry instantiation | ||
* time. Default is {@code false} to avoid the documented resource exhaustion issue. | ||
*/ | ||
default boolean isPushAlignedGlobally() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably spend endless time coming up with better and better names, but here's a third iteration where I've avoided using "epoch" in the name because I think it will be less clear what this does compared to focusing on the effect of pushes being aligned globally if you set this to true
. Anyone who has feedback about the naming, feel free to share (before we release it, ideally).
I've updated the JavaDoc to explain the issue that motivated this configuration being introduced and the reason for its default.
I don't think so. I'm asking for some review internally. Once that is done, we should be good to go. If you have any thoughts on naming as I've recently updated, please share. Or if you have thoughts on the below. On how we'll apply this where, we'll keep the default for the configuration |
The publishing should not happen relative to the epoch step with publishAtStep set to false. Rather it should happen relative to the meters' step, which is offset from the epoch steps by `registryStartMillis`.
We know the previous behavior causes issues with many instances running, which makes it not a good default. We will leave this configuration for now so that users have a way to opt out of the new behavior if it causes unforeseen issues for them. Adds a warning to the JavaDoc about the side effect of aligning to the Epoch.
The previous test was flaky and relied on system-clock timings. This is replaced with a more direct test of calculating the delay by extracting the logic to a package-private method.
We can consider such a change separate from this changeset.
Record in the middle of the step to avoid histogram counts being rotated out by the time of the simulated publication in the affected tests.
This updates names to hopefully make more clear their usage, while also polishing the JavaDocs to give more background.
8f4fa54
to
3752129
Compare
Putting this in draft mode. Will close this PR in a week. For additional info see the linked issue for why this fix might not be desired. |
Fixed by #3750 |
This PR tries to solve this issue (#2818). The start time is captured when the push meter registry is initialized. For Step based meters, this is passed to the Step Value and StepTupple to align them to the registry start time and not the step start time (see - #1218).
TODO:
1.8.x
alignToEpoch
config flipped