Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/otlp/metrics: fix otel.* namespaced metrics #142

Closed
wants to merge 3 commits into from

Conversation

gbbr
Copy link
Contributor

@gbbr gbbr commented Aug 2, 2023

This change fixes a problem where sytem.* and process.* metrics were correctly namespaced with otel.*, but the original metric was not kept.

This change fixes a problem where sytem.* and process.* metrics were
correctly namespaced with otel.*, but the original metric was not kept.
@gbbr gbbr requested a review from a team as a code owner August 2, 2023 14:19
@gbbr gbbr requested a review from songy23 August 2, 2023 14:19
@@ -70,7 +70,9 @@ func remapSystemMetrics(all pmetric.MetricSlice, m pmetric.Metric) {
copyMetric(all, m, "system.disk.in_use", 1)
}
// process.* and system.* metrics need to be prepended with the otel.* namespace
m.SetName("otel." + m.Name())
newm := all.AppendEmpty()
Copy link
Member

@songy23 songy23 Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with @liustanley , only the metrics from the previous extractSystemMetrics should be kept: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/23445/files#diff-59a3a2cae4395f15cad4800806f9dc7680aed0f303dd896c0ca8663e0fd0ac35L56

For process.* metrics, only the ones with otel.* prefix are expected, original ones are expected to be overriden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but I don't see the difference between this implementation and the one in the linked code. All metrics that start with system.* and process.* seem to get the prefix. I don't see any special case for process.*

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of system.* and process.* metrics get the otel. prefix - that's right. In addition to that, the original system.* metrics (no otel. prefix) should be preserved / copied, while process.* metrics don't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but I don't see that in the old code. The behaviour seems the same for both process and system

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, in my latest commit, I think that I have applied what you meant. PTAL

@songy23 songy23 requested a review from liustanley August 2, 2023 15:21
@gbbr gbbr requested a review from songy23 August 4, 2023 12:41
@@ -69,7 +69,12 @@ func remapSystemMetrics(all pmetric.MetricSlice, m pmetric.Metric) {
case "system.filesystem.utilization":
copyMetric(all, m, "system.disk.in_use", 1)
}
// process.* and system.* metrics need to be prepended with the otel.* namespace
if strings.HasPrefix(name, "system.") {
// we keep the original metric for system.* metrics
Copy link
Member

@songy23 songy23 Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked again at the old code, I think the behavior is still different: we prepend otel. to the original names, but we keep the remapped names with no otel. prefix. https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/23445/files#diff-59a3a2cae4395f15cad4800806f9dc7680aed0f303dd896c0ca8663e0fd0ac35L115-L129

For example, in the old code:

  • if we have "system.filesystem.utilization", first we remap it to "system.disk.in_use" and append the new one to slice
  • then we prepend otel. to "system.filesystem.utilization" but not
    "system.disk.in_use"
  • in the end we got "otel.system.filesystem.utilization" and "system.disk.in_use"

vs.

  • Before this PR, in the end we got "otel.system.filesystem.utilization" and "otel.system.disk.in_use"
  • With the current PR, in the end we got "system.filesystem.utilization", "system.disk.in_use", "otel.system.filesystem.utilization" and "otel.system.disk.in_use"

Both of ^ are wrong. In short, we prepend otel. to the original metrics, but don't prepend it to the remapped metrics.

Sorry for the confusion, I was also trying to understand the old behavior.

Copy link
Contributor Author

@gbbr gbbr Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are misunderstanding the current PR. It works exactly like you expect (in your "old code" example above). I have simplified the tests for you to make it easier for you to understand. Feel free to set up a call if it's still not clear.

Now, in the tests:

  • The in value contains the original metric
  • The out value contains any new (!) metrics added. In the case of system.* metrics you will notice there is a copy of the original.
  • The test asserts that the original metric (from in, not out) is prepended with otel.* when needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken: using the same example, now after remapMetrics we're getting 3 metrics: "otel.system.filesystem.utilization", "system.disk.in_use" and "system.filesystem.utilization"?

If so, the last one "system.filesystem.utilization" shouldn't be there

in: metric("system.cpu.load_average.1m", point{f: 1}),
out: []pmetric.Metric{
metric("system.load.1", point{f: 1}),
metric("system.cpu.load_average.1m", point{f: 1}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be otel.system.cpu.load_average.1m ?

require.Equal(t, out, dest.At(dest.Len()-len(tt.out)+i))
}
t.Run("", func(t *testing.T) {
dest := pmetric.NewMetricSlice()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more clear if in is in dest which is consistent with:

@@ -69,7 +69,12 @@ func remapSystemMetrics(all pmetric.MetricSlice, m pmetric.Metric) {
case "system.filesystem.utilization":
copyMetric(all, m, "system.disk.in_use", 1)
}
// process.* and system.* metrics need to be prepended with the otel.* namespace
if strings.HasPrefix(name, "system.") {
// we keep the original metric for system.* metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken: using the same example, now after remapMetrics we're getting 3 metrics: "otel.system.filesystem.utilization", "system.disk.in_use" and "system.filesystem.utilization"?

If so, the last one "system.filesystem.utilization" shouldn't be there

},
},
{
in: metric("system.filesystem.utilization", point{f: 15}),
out: []pmetric.Metric{metric("system.disk.in_use", point{f: 15})},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I feel I'm getting more confused now - looks like the previous test is doing the the right thing, but how comes the resulting metric in datadog exporter got an otel. prefix? open-telemetry/opentelemetry-collector-contrib#23445 (comment)

@songy23
Copy link
Member

songy23 commented Aug 8, 2023

I finally found the root cause - the bug isn't in this file metrics_remapping.go, but in metrics_translator.go:

for k := 0; k < metricsArray.Len(); k++ {
md := metricsArray.At(k)
if v, ok := runtimeMetricsMappings[md.Name()]; ok {
metadata.Languages = extractLanguageTag(md.Name(), metadata.Languages)
for _, mp := range v {
if mp.attributes == nil {
// duplicate runtime metrics as Datadog runtime metrics
cp := metricsArray.AppendEmpty()
md.CopyTo(cp)
cp.SetName(mp.mappedName)
break
}
if md.Type() == pmetric.MetricTypeSum {
mapSumRuntimeMetricWithAttributes(md, metricsArray, mp)
} else if md.Type() == pmetric.MetricTypeGauge {
mapGaugeRuntimeMetricWithAttributes(md, metricsArray, mp)
} else if md.Type() == pmetric.MetricTypeHistogram {
mapHistogramRuntimeMetricWithAttributes(md, metricsArray, mp)
}
}
}
if t.cfg.withRemapping {
remapMetrics(metricsArray, md)
}

The for loop above is unfortunately an anti-pattern: you are appending more elements to metricsArray while iterating over it. For example,

  1. say initially metricsArray == [system.filesystem.utilization]
  2. in the first iteration, you called remapMetrics so system.disk.in_use gets appended to metricsArray, and system.filesystem.utilization gets renamed to otel.system.filesystem.utilization. Now metricsArray == [otel.system.filesystem.utilization, system.disk.in_use]. If we stop here, everything is correct
  3. however the outer for-loop will continue, so in the second iteration, you called remapMetrics again on system.disk.in_use so it gets prepended with otel. too. Now metricsArray == [otel.system.filesystem.utilization, otel.system.disk.in_use]. The second metrics is now incorrect.

Sent out #148 to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants