Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix]: Fix histogram merge #324

Merged
merged 3 commits into from
May 14, 2024

Conversation

wildum
Copy link
Contributor

@wildum wildum commented Apr 11, 2024

The current implementation is not merging the histogram correctly. The sum is calculated as follow:

count * (existingMean + currentMean) / 2

Now let's say that the existing histogram has a count of 1000 and an existing mean of 10000, the incoming histogram has a count of 100 and a mean of 1000.

newCount = 1000 + 100 = 1100
newMean = (10000 + 1000) / 2 = 5500
newSum = 1100 * newMean = 6_050_000

Now the next incoming histogram has a count of 1000 and an existing mean of 100:

newCount = 1100 + 1000 = 2100
newMean = (5500 +100) / 2 = 2800
newSum = 2100 * 2800 = 5_880_000

By definition, the sum of the histogram is supposed to behave like a counter (A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.)

In the example above, the sum decreased (5_880_000 < 6_050_000) which is not something that should happen.

You can see it happening with real values:

image

With this PR, the computation of the sum changes to the following:

sum = existingSum + (currentMean * count)

Taking the previous example, we would have:

previousSum = 10_000_000
newSum = previousSum * (1000 * 100) = 10_100_000

previousSum = 10_100_000
newSum = previousSum * (100 * 1000) = 10_200_000

The sum is increasing and the computation is correct: (the sum of 100 buckets with a mean value of 1000 should be equal to the sum of 1000 buckets with a mean value of 100)

kgeckhart and others added 3 commits April 11, 2024 12:15
…rying to merge two means

Signed-off-by: William Dumont <william.dumont@grafana.com>
Signed-off-by: William Dumont <william.dumont@grafana.com>
Signed-off-by: William Dumont <william.dumont@grafana.com>
@enisoc
Copy link

enisoc commented May 14, 2024

This looks like it will also fix a separate bug that I came here to report:

We've observed that the exported histogram _count was wrong when delta aggregation is turned on. The _count should be equal to the _bucket{le="+Inf"} value (which was itself correct), but instead the _count value was equal to the sum of all bucket counts.

I think that bug is due to this code:

var count uint64
for _, v := range current.Buckets {
count += v
}

This PR replaces that code entirely, and the new code looks correct:

	// Increment totals based on incoming totals
	h.Sum += other.Sum
	h.Count += other.Count

Copy link
Contributor

@SuperQ SuperQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long review delay, this looks good.

Thanks!

@SuperQ SuperQ merged commit f002db5 into prometheus-community:master May 14, 2024
3 checks passed
SuperQ added a commit that referenced this pull request May 15, 2024
* [BUGFIX] Fix histogram merge #324

Signed-off-by: SuperQ <superq@gmail.com>
@SuperQ SuperQ mentioned this pull request May 15, 2024
SuperQ added a commit that referenced this pull request May 15, 2024
* [BUGFIX] Fix histogram merge #324

Signed-off-by: SuperQ <superq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants