Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently use baseTimeUnit in StatsD #5687

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

coreyoconnor
Copy link

Draft: Opening for discussion with (incomplete) proposed changes.

There is a baseTimeUnit for Timers that appears to be inconsistently used. Given the definition "The base time unit of the timer to which all published metrics will be scaled" I'd expect that a value, eg 2000 sec, reported as the value 2000 with the unit seconds would be emitted to statsd as the value in the base time units, 2000 in this example. However, the actual value is 2,000,000. Which is the corresponding millisecond value for the seconds expected to be recorded.

As that explainer above is a challenge to read: I've included some (untested or even compiled) changes to illustrate.

@pivotal-cla
Copy link

@coreyoconnor Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

@pivotal-cla
Copy link

@coreyoconnor Thank you for signing the Contributor License Agreement!

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Nov 15, 2024

I think there might be a misunderstanding since Timers does not really have a time unit from user's perspective.
(They have a unit under the hood: always nanoseconds but that is not available for the users, they don't need to know about this internal detail).

The expectation is that a recorded value is converted to the unit of the registry before publication. When you record a value using a timer and a unit, the unit you used for the recording has nothing to do with the unit used for publishing. (Btw we don't recommend measuring and recording duration like that, but we recommend using Timer.Sample, see the docs.)

This means that if the unit of the registry is milliseconds, it does not matter if you record in seconds, millis or micros, everything will be converted and published in millis.
StatsD happens to use milliseconds by default:

@Override
protected TimeUnit getBaseTimeUnit() {
return TimeUnit.MILLISECONDS;
}

So using your example, if you do this:

timer.record(2_000, TimeUnit.SECONDS);  // about 30 min

It will be (recorded as nanos under the hood and) published in the unit of the registry. Since StatsD uses milliseconds as the unit, you will see 2,000,000ms when published.

This is the expected behavior which is also guarded by tests and that's why that your change in AbstractTimer fails the tests. If you want to override the time unit of a registry, you can override the getBaseTimeUnit method I linked above but based on your diff it seems that millis is hardcoded in a few places in case of StatsD which we can fix but is there any backend that can use StatsD with a non-milliseconds unit?

@jonatan-ivanov jonatan-ivanov added the waiting for feedback We need additional information before we can continue label Nov 15, 2024
@coreyoconnor
Copy link
Author

Starting from the end as that will explain things:

If you want to override the time unit of a registry, you can override the getBaseTimeUnit method I linked above but based on your diff it seems that millis is hardcoded in a few places in case of StatsD which we can fix

indeed! I am using with getBaseTimeUnit overriden in the registery. Which I would expect to override (ha!) the time unit used. As you can tell: no such luck.

But what backend is this for? Datadog

There are several other issues where users have encountered the same unexpected behavior from datadog. I can pull these up later.

Specifically: Datadog does not support units for timers. Timers are unit-less gauges.

So, setting the units has no effect - datadog is going to accept them as unit-less and presumes you've pre-scaled them to whatever you want or manually changed the metric config in the UI.

Obviously, changing the metric config in datadog for every timer after they've been submitted is a non-starter.

So what to do? Seems to me like overriding the unit should do what is documented as: Overrides the unit used. "The base time unit of the timer to which all published metrics will be scaled". Which, if it was true, would do what is expected for datadog.

So even if there is no room for adjusting micrometer to scale the values nicely: The documentation is a bit misleading no?

@jonatan-ivanov
Copy link
Member

As I said above, it seems that "millis" is hardcoded in a few places in case of StatsD which we can fix. Please remove your change from AbstractTimer and replace hardcoded TimeUnit.MILLISECONDS to baseTimeUnit(). I think doing that should make you able to override the base unit. (I can help to write a tests for this use-case.)

Could you please also run ./gradlew format and ./gradlew build to see if your changes break the build or not? (Right now they do.)

The documentation is a bit misleading no?

If I understand this correctly, I don't think the documentation is misleading, I think the docs are ok but we have a bug: I agree with the intention of the docs but the current behavior does not follow that. :) After fixing the issue, and you still think the docs are misleading, please open an issue/PR.

@jonatan-ivanov jonatan-ivanov changed the title fix: consistently use baseTimeUnit Consistently use baseTimeUnit in StatsD Nov 25, 2024
@jonatan-ivanov jonatan-ivanov added bug A general bug registry: statsd A StatsD Registry related issue and removed waiting for feedback We need additional information before we can continue labels Nov 25, 2024
@jonatan-ivanov jonatan-ivanov added this to the 1.13.9 milestone Nov 25, 2024
Copy link
Member

@jonatan-ivanov jonatan-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also fix StatsdFunctionTimer and StatsdLongTaskTimer?

@@ -258,7 +258,7 @@ public void record(Runnable f) {
@Override
public final void record(long amount, TimeUnit unit) {
if (amount >= 0) {
histogram.recordLong(TimeUnit.NANOSECONDS.convert(amount, unit));
histogram.recordLong(baseTimeUnit().convert(amount, unit));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this change, this should be recorded as nanos and converted to the right unit at publish-time.

@@ -43,9 +43,9 @@ public class StatsdTimer extends AbstractTimer {

StatsdTimer(Id id, StatsdLineBuilder lineBuilder, FluxSink<String> sink, Clock clock,
DistributionStatisticConfig distributionStatisticConfig, PauseDetector pauseDetector, TimeUnit baseTimeUnit,
long stepMillis) {
long stepBaseUnits) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
long stepBaseUnits) {
long amount) {

super(id, clock, distributionStatisticConfig, pauseDetector, baseTimeUnit, false);
this.max = new StepDouble(clock, stepMillis);
this.max = new StepDouble(clock, stepBaseUnits);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.max = new StepDouble(clock, stepBaseUnits);
this.max = new StepDouble(clock, amount);

@coreyoconnor
Copy link
Author

Yes, certainly: I will resolve the tests. As well as the other issues. Where would be good to document this aspect of datadog? At least enough to point somebody in the right direction. The statsd registry or config?

As you mentioned statsd works in milliseconds. So, Datadog is an exceptional case then? I'm not familiar with the standards but sounds like it! In that case, should this be a datadog specific config somehow? Or is the override of the base time unit sufficient to capture the exceptional aspect?

for reference, the related motivating tickets.

These are duplicates - just adding here for completeness

discussion about the the unitless of timers:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug registry: statsd A StatsD Registry related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants