-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/spanmetrics] Support for specifying Resource Attributes on metrics generated from traces #6717
Conversation
@@ -5,12 +5,16 @@ | |||
## 🛑 Breaking changes 🛑 | |||
|
|||
- `memcachedreceiver`: Update metric names (#6594) | |||
- `spanmetricproccessor`: service.name attribute added as a default resource attribute instead of attribute in generated metrics (#6717) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to update the changelog, the maintainers will do this upon a new release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw in Zhihao's PR that there was a changelog added (see comment #6503 (review)). @MovieStoreGuy can you confirm whether we update the changelog or the maintainers do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine for us to update the changelog, we have the greatest overall idea of what has changed so I think it makes sense to add it. However, If a maintainer says otherwise, I am also fine with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would definitely prefer that you add a changelog entry, and there is CI automation that will complain if you don't! It generally makes it easier for us to prepare releases when we don't need to track down all the changes that have been made and think about how to describe them.
latencyCount map[metricKey]uint64 | ||
latencySum map[metricKey]float64 | ||
latencyBucketCounts map[metricKey][]uint64 | ||
latencyCount map[resourceKey]map[metricKey]uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this logic need to change, and why does the map structure need to become more complex? Currently it works in a way where each unique metric key is mapped to a unique attribute set. Can the metricKey just take into account the resource attributes, rather than having an entirely new key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enables quicker look up time. The way OTel defines the structure to hold metrics is some thing roughly like (I've omitted the instrumentation library to keep things simplified, if you want to see the actual structure you can check out the metric proto definition):
--> Resource 1
-> resource attributes (service.name == my.service.1, etc)
-> metric 1
-> data point 1
-> data point 2
-> metric 2
-> data point 3
-> Resource 2
-> resource attributes (service.name == my.service.2, etc)
-> metric 3
-> data point 4
etc
This means we need to index on resource key since to build that metric structure we need to find all the metrics/data points that have the same resource attributes and insert that way. (There wasn't any way to lookup on this structure afaict, it's just indexed by integers so we needed to insert all the relevant metrics/dp at the time of creation of the resource/resource attribute structure).
If we continued to use the singular map structure that they used, we'd need to iterate over the entire metric key set and check if the key is prepended with the same resource attributes for every resource attributes - this would increase the time complexity.
resourceAttrKey := resourceKey(key) | ||
resourceAttributesMap := p.resourceKeyToDimensions[resourceAttrKey] | ||
|
||
// if the service name doesn't exist, we treat it as invalid and do not generate a trace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be a breaking change to those using this processor without service.name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roughly answered here: #6717 (comment)
but service.name
is defined as a mandatory resource attribute by the OTel semantic convention so it should be there. Happy to remove this if we want, it shouldn't break any existing logic. The main breaking change is that we are moving the service.name field from being extracted from a resource attribute and transformed into a metric normal attribute (dimension) to being extracted from a resource attribute and transformed into a metric resource attribute as default behaviour.
ilm.InstrumentationLibrary().SetName("spanmetricsprocessor") | ||
|
||
// build metrics per resource | ||
p.collectCallMetrics(ilm, resourceAttrKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we can (probably?) create new metrics under new resources without complicating the maps with resourceAttrKey
. Happy to be shown otherwise.
But can't we iterate through the resources and still maintain the same metric keys? They will still be unique and include the resource attributes.
It's just quite a big change and if you want this to be merged upstream then I have worries if it's big and complex, there could be some push back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could in theory create a new resource for every metric even if they have the same resource attributes but this way is more optimised for the processor in terms of memory. It's a trade-off I guess, simplicity of code vs memory. Personally I think memory optimisation is more important and I don't think the changes are that complex (albeit definitely more complex than the original). Do you think it's more worth it to have simple code?
} | ||
} | ||
|
||
k := metricKey(metricKeyBuilder.String()) | ||
return k | ||
} | ||
|
||
// buildResourceAttrKey builds the metric key from the service name and will attempt to add any additional resource attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we making the assumption that every resource revolves around service name? If so what the is reasoning behind this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OTel defines service.name as a mandatory resource attribute (see https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md). Originally, the processor defined service.name as a mandatory attribute and fetched it from the resource attributes then set it as a normal attribute. This didn't make much sense to me so I altered the behaviour so that it's a default resource attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to go through another pass of this but there is a lot has changed however, I worry that access to internal data resources is not using atomic access and would ultimate lead to either incorrect data being emitted or worst case, a panic.
Could I please ask you to double check each method that reads or depends on values stored inside the process.
func (p *processorImp) collectLatencyMetrics(ilm pdata.InstrumentationLibraryMetrics) { | ||
for key := range p.latencyCount { | ||
func (p *processorImp) collectLatencyMetrics(ilm pdata.InstrumentationLibraryMetrics, resAttrKey resourceKey) { | ||
for mKey := range p.latencyCount[resAttrKey] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not certain but you may have to hold the read lock for this loop since it could change underneath your feet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is held in the code that calls this function so it should be fine right?
func (p *processorImp) collectCallMetrics(ilm pdata.InstrumentationLibraryMetrics) { | ||
for key := range p.callSum { | ||
func (p *processorImp) collectCallMetrics(ilm pdata.InstrumentationLibraryMetrics, resAttrKey resourceKey) { | ||
for mKey := range p.callSum[resAttrKey] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the same as well, you'll need to hold a lock here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is held in the code that calls this function so it should be fine right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is fine
func (p *processorImp) updateCallMetrics(key metricKey) { | ||
p.callSum[key]++ | ||
func (p *processorImp) updateCallMetrics(resourceAttrKey resourceKey, mKey metricKey) { | ||
if _, ok := p.callSum[resourceAttrKey]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential for lock as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is held in the code that calls this function so it should be fine right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that someone could mistake these functions safe for concurrent access and trip up.
From what I understand, it is depends on the caller to make sure this happens?
It seems okay since the method is not public, however, it does raise a question if this should be a seperate function or just one big function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This processor is stateful and should not be executed concurrently. We are also using a single mutex for the processor. We should treat all of the methods of this processor as concurrent unsafe. Does that make more sense to take care of lock/unlock only at the top level caller method func (p *processorImp) ConsumeTraces
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConsumeTraces()
should be concurrent-safe and that's it. IMO it's kind of impractical to expect every function here to be safe, and it's reasonable to expect the caller to take concurrency into account. That is how the component is designed before we touched it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put the change into this commit: fd42b76
return dims | ||
} | ||
|
||
func concatDimensionValue(metricKeyBuilder *strings.Builder, value string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your suggestion of simplifying this and I propose to build on this further with the following goals in mind:
- Remove the assumption that a new/empty metricKeyBuilder should be supplied to
concatDimensionValue
in order to avoid prefixing the separator. - Remove the assumption that
metricKeyBuilder
is notnil
. - Remove the need to provide a reference to
metricKeyBuilder
each time, which is a means of maintaining the key building "in progress" state. - Make the usage simpler and more intuitive with a more idiomatic function name and avoid repetition through variadic parameters.
So instead of:
var metricKeyBuilder strings.Builder
concatDimensionValue(&metricKeyBuilder, span.Name())
concatDimensionValue(&metricKeyBuilder, span.Kind().String())
concatDimensionValue(&metricKeyBuilder, span.Status().Code().String())
The usage could be simplified to:
mkb := keybuilder.NewMetricKeyBuilder()
mkb.Append(
serviceName,
span.Name(),
span.Kind().String(),
span.Status().Code().String(),
)
My proposal is to separate and encapsulate the stateful concern of metrics key building from the processor into a package processor/spanmetricsprocessor/keybuilder
:
package keybuilder
import "strings"
const (
metricKeySeparator = string(byte(0))
)
type (
MetricKeyBuilder interface {
Append(value ...string)
String() string
}
metricKeyBuilder struct {
sb strings.Builder
separator string
}
)
func NewMetricKeyBuilder() MetricKeyBuilder {
return &metricKeyBuilder{}
}
func (mkb *metricKeyBuilder) Append(values ...string) {
for _, value := range values {
// It's worth noting that from pprof benchmarks, WriteString is the most expensive operation of this processor.
// Specifically, the need to grow the underlying []byte slice to make room for the appended string.
mkb.sb.WriteString(mkb.separator)
mkb.sb.WriteString(value)
mkb.separator = metricKeySeparator
}
}
func (mkb *metricKeyBuilder) String() string {
return mkb.sb.String()
}
Please let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this approach, the only thing I would change is the method name of keybuilder.NewMetricKeyBuilder()
to keybuilder.New()
to avoid package name stuttering.
I am fairly certain that strings.Builder
has a .Grow(by int)
method that we can use on the NewMethod which can define a package level default of what we consider to be a reasonable size byte slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only thing I would change is the method name of keybuilder.NewMetricKeyBuilder() to keybuilder.New() to avoid package name stuttering.
I am fairly certain that strings.Builder has a .Grow(by int) method that we can use on the NewMethod which can define a package level default of what we consider to be a reasonable size byte slice?
+1 to both suggestions.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
70aaed2
to
91c8b7d
Compare
- Use LRU cache for the "resource_attributes". - Add document/usage for the "resource_attributes" configuration.
ef982d6
to
8ea124c
Compare
@@ -234,13 +249,13 @@ func (p *processorImp) ConsumeTraces(ctx context.Context, traces pdata.Traces) e | |||
return err | |||
} | |||
|
|||
p.reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure resetting here will not affect the metrics in m
? If there are reference types I would want to be sure that they wouldn't be affected in the following line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am sure the m
should not be impacted. m
is a pointer but it's not held by the processor.
I have moved it to a defer function so that we can make the downstream components get the result a little bit quicker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, kind of concerned with this reset
since it's not under any lock. There was a reason that I previously reset exemplar data only inside buildMetrics()
. I'm not sure how this got to be called outside of buildMetrics()
, but I think that's a mistake. Ideally using the same write lock that is obtained in buildMetrics()
to avoid race conditions. Would be nice to have this concurrency verified by tests too.
p.reset() | ||
p.lock.Unlock() | ||
}() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say there is thread (a) and (b) both calling ConsumeTraces()
.
- Thread (a) obtains write lock and
aggregateMetrics()
and releases lock. - Thread (b) obtains write lock and
aggregateMetrics()
, releases lock and then also goes ontobuildMetrics()
. - Now thread (a) calls
buildMetrics()
with the data from both (a) and (b). Specifically in the delta case this would produce incorrect metrics.
Do you agree? Potentially buildMetrics and aggregateMetrics should be under the same lock? @MovieStoreGuy any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it also makes sense. The processor is stateful and should not be used concurrently. In practice, it should not be used concurrently either based on the design of OTEL Collector.
I would change the lock to make sure the processor can only be executed serially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this, and from @MovieStoreGuy 's comment:
Locks protect data. If there is no data to protect, you don't need a lock. If something isn't thread safe, then there's data being accessed, by definition.
If the data is "the entire processorImp
", that's fine, but it should be documented as such, and locked as such.
Also, if you're saying something isn't thread safe, you don't need a lock. When you declare something as not thread safe, you're explicitly pushing that concern to the caller - if they want to make it thread safe, they can add their own lock. If you have a lock (and you use it properly!), it is thread safe by definition, because that's what locks give you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it also makes sense. The processor is stateful and should not be used concurrently. In practice, it should not be used concurrently either based on the design of OTEL Collector.
I would change the lock to make sure the processor can only be executed serially.
I think I agree with the first statement here, that the processor should not be used concurrently, but I'm not sure about the second. There is nothing in the design of the collector that prevents, prohibits, or even discourages concurrent execution. Any component can spawn new goroutines at any time, and it should be anticipated that receivers will do so regularly. While a pipeline is conceptually a serial process, there may be many instantiations of the pipeline concurrently executing and any component in a pipeline may trigger delayed or asynchronous execution of the next step in the pipeline.
I would fully expect that this processor's ConsumeTraces()
method will be invoked concurrently and thus it should ensure that appropriate measures are taken to ensure data consistency. If it needs to ensure a consistent view of its state without the potential for changes by other concurrent invocations, a lock should probably be held for the duration of an invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thankyou @Aneurysm9 for clearing this up. Fully in favour of holding the lock for the whole invocation then.
By the way, this PR has moved to #7075 (@Tenaria can you close this one now you're back?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiedotguy regarding the lock, it is not protecting the processorImp()
but to make it synchronized. I've updated the comment and hope that's more explicitly now: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/7075/files#diff-78f9e7446e0e4fb474786b4b2c262486887bf27479f1f8bc598376d6c3c29b72R72.
But as we discussed here, we probably should consider running multiple sub-processor so that each sub-processing itself is synchronized(still locked by a mutex) but these sub-processors can be executed concurrently.
see previous comment. |
Description:
Currently there is a bug / no logic to differentiate between resource attributes. This PR adds the feature to allow users to optionally specify resource attributes to append similar to the existing dimensions mechanism.
The logic now also indexes by the specified resource attributes to ensure accurate aggregation of data.
Service Name has been moved to become a default resource attribute instead of attribute as per the resource semantic convention https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/resource/semantic_conventions.
There is a bit of discrepancy between the use of terms "Dimensions" and "Attributes". It seems like "Attributes" is more commonly used so I have used that term in this PR for adding in resource attributes. I propose that we rename dimensions to attributes in the future, altho there will need to be some backwards compatibility added to the config to support this.
Link to tracking Issue:
#6486
Testing:
Testing added to ensure new structure/hierarchy of metrics under instrumentationLibraryMetrics structure under Resource is generated correctly.
Documentation:
Usage of the new config option
resource_attributes
added to README.md