[processor/spanmetrics] Support for specifying Resource Attributes on metrics generated from traces #6717

Tenaria · 2021-12-13T04:23:49Z

Description:
Currently there is a bug / no logic to differentiate between resource attributes. This PR adds the feature to allow users to optionally specify resource attributes to append similar to the existing dimensions mechanism.

The logic now also indexes by the specified resource attributes to ensure accurate aggregation of data.

Service Name has been moved to become a default resource attribute instead of attribute as per the resource semantic convention https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/resource/semantic_conventions.

There is a bit of discrepancy between the use of terms "Dimensions" and "Attributes". It seems like "Attributes" is more commonly used so I have used that term in this PR for adding in resource attributes. I propose that we rename dimensions to attributes in the future, altho there will need to be some backwards compatibility added to the config to support this.

Link to tracking Issue:
#6486

Testing:
Testing added to ensure new structure/hierarchy of metrics under instrumentationLibraryMetrics structure under Resource is generated correctly.

Documentation:
Usage of the new config option resource_attributes added to README.md

…trics processor

processor/spanmetricsprocessor/config.go

processor/spanmetricsprocessor/processor.go

jamesmoessis · 2021-12-13T05:25:54Z

CHANGELOG.md

@@ -5,12 +5,16 @@
 ## 🛑 Breaking changes 🛑

 - `memcachedreceiver`: Update metric names (#6594)
+- `spanmetricproccessor`: service.name attribute added as a default resource attribute instead of attribute in generated metrics (#6717)


I don't think you need to update the changelog, the maintainers will do this upon a new release

I saw in Zhihao's PR that there was a changelog added (see comment #6503 (review)). @MovieStoreGuy can you confirm whether we update the changelog or the maintainers do?

I think it is fine for us to update the changelog, we have the greatest overall idea of what has changed so I think it makes sense to add it. However, If a maintainer says otherwise, I am also fine with it.

We would definitely prefer that you add a changelog entry, and there is CI automation that will complain if you don't! It generally makes it easier for us to prepare releases when we don't need to track down all the changes that have been made and think about how to describe them.

jamesmoessis · 2021-12-13T05:31:18Z

processor/spanmetricsprocessor/processor.go

-	latencyCount         map[metricKey]uint64
-	latencySum           map[metricKey]float64
-	latencyBucketCounts  map[metricKey][]uint64
+	latencyCount         map[resourceKey]map[metricKey]uint64


Why does this logic need to change, and why does the map structure need to become more complex? Currently it works in a way where each unique metric key is mapped to a unique attribute set. Can the metricKey just take into account the resource attributes, rather than having an entirely new key?

This enables quicker look up time. The way OTel defines the structure to hold metrics is some thing roughly like (I've omitted the instrumentation library to keep things simplified, if you want to see the actual structure you can check out the metric proto definition):

--> Resource 1 -> resource attributes (service.name == my.service.1, etc) -> metric 1 -> data point 1 -> data point 2 -> metric 2 -> data point 3 -> Resource 2 -> resource attributes (service.name == my.service.2, etc) -> metric 3 -> data point 4 etc

This means we need to index on resource key since to build that metric structure we need to find all the metrics/data points that have the same resource attributes and insert that way. (There wasn't any way to lookup on this structure afaict, it's just indexed by integers so we needed to insert all the relevant metrics/dp at the time of creation of the resource/resource attribute structure).

If we continued to use the singular map structure that they used, we'd need to iterate over the entire metric key set and check if the key is prepended with the same resource attributes for every resource attributes - this would increase the time complexity.

jamesmoessis · 2021-12-13T05:34:11Z

processor/spanmetricsprocessor/processor.go

+		resourceAttrKey := resourceKey(key)
+		resourceAttributesMap := p.resourceKeyToDimensions[resourceAttrKey]
+
+		// if the service name doesn't exist, we treat it as invalid and do not generate a trace


Would this be a breaking change to those using this processor without service.name?

roughly answered here: #6717 (comment)

but service.name is defined as a mandatory resource attribute by the OTel semantic convention so it should be there. Happy to remove this if we want, it shouldn't break any existing logic. The main breaking change is that we are moving the service.name field from being extracted from a resource attribute and transformed into a metric normal attribute (dimension) to being extracted from a resource attribute and transformed into a metric resource attribute as default behaviour.

processor/spanmetricsprocessor/processor.go

jamesmoessis · 2021-12-13T05:39:58Z

processor/spanmetricsprocessor/processor.go

+		ilm.InstrumentationLibrary().SetName("spanmetricsprocessor")
+
+		// build metrics per resource
+		p.collectCallMetrics(ilm, resourceAttrKey)


I feel like we can (probably?) create new metrics under new resources without complicating the maps with resourceAttrKey. Happy to be shown otherwise.

But can't we iterate through the resources and still maintain the same metric keys? They will still be unique and include the resource attributes.

It's just quite a big change and if you want this to be merged upstream then I have worries if it's big and complex, there could be some push back.

Yeah, we could in theory create a new resource for every metric even if they have the same resource attributes but this way is more optimised for the processor in terms of memory. It's a trade-off I guess, simplicity of code vs memory. Personally I think memory optimisation is more important and I don't think the changes are that complex (albeit definitely more complex than the original). Do you think it's more worth it to have simple code?

jamesmoessis · 2021-12-13T05:42:56Z

processor/spanmetricsprocessor/processor.go

 		}
 	}

 	k := metricKey(metricKeyBuilder.String())
 	return k
 }

+// buildResourceAttrKey builds the metric key from the service name and will attempt to add any additional resource attributes


Are we making the assumption that every resource revolves around service name? If so what the is reasoning behind this

OTel defines service.name as a mandatory resource attribute (see https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md). Originally, the processor defined service.name as a mandatory attribute and fetched it from the resource attributes then set it as a normal attribute. This didn't make much sense to me so I altered the behaviour so that it's a default resource attribute.

processor/spanmetricsprocessor/processor_test.go

MovieStoreGuy

I need to go through another pass of this but there is a lot has changed however, I worry that access to internal data resources is not using atomic access and would ultimate lead to either incorrect data being emitted or worst case, a panic.

Could I please ask you to double check each method that reads or depends on values stored inside the process.

processor/spanmetricsprocessor/config.go

processor/spanmetricsprocessor/processor.go

MovieStoreGuy · 2021-12-13T05:15:51Z

processor/spanmetricsprocessor/processor.go

-func (p *processorImp) collectLatencyMetrics(ilm pdata.InstrumentationLibraryMetrics) {
-	for key := range p.latencyCount {
+func (p *processorImp) collectLatencyMetrics(ilm pdata.InstrumentationLibraryMetrics, resAttrKey resourceKey) {
+	for mKey := range p.latencyCount[resAttrKey] {


I am not certain but you may have to hold the read lock for this loop since it could change underneath your feet.

The lock is held in the code that calls this function so it should be fine right?

MovieStoreGuy · 2021-12-13T05:16:33Z

processor/spanmetricsprocessor/processor.go

-func (p *processorImp) collectCallMetrics(ilm pdata.InstrumentationLibraryMetrics) {
-	for key := range p.callSum {
+func (p *processorImp) collectCallMetrics(ilm pdata.InstrumentationLibraryMetrics, resAttrKey resourceKey) {
+	for mKey := range p.callSum[resAttrKey] {


I think this is the same as well, you'll need to hold a lock here.

The lock is held in the code that calls this function so it should be fine right?

Yeah this is fine

MovieStoreGuy · 2021-12-13T05:17:09Z

processor/spanmetricsprocessor/processor.go

-func (p *processorImp) updateCallMetrics(key metricKey) {
-	p.callSum[key]++
+func (p *processorImp) updateCallMetrics(resourceAttrKey resourceKey, mKey metricKey) {
+	if _, ok := p.callSum[resourceAttrKey]; ok {


Potential for lock as well.

The lock is held in the code that calls this function so it should be fine right?

I worry that someone could mistake these functions safe for concurrent access and trip up.

From what I understand, it is depends on the caller to make sure this happens?

It seems okay since the method is not public, however, it does raise a question if this should be a seperate function or just one big function?

This processor is stateful and should not be executed concurrently. We are also using a single mutex for the processor. We should treat all of the methods of this processor as concurrent unsafe. Does that make more sense to take care of lock/unlock only at the top level caller method func (p *processorImp) ConsumeTraces?

ConsumeTraces() should be concurrent-safe and that's it. IMO it's kind of impractical to expect every function here to be safe, and it's reasonable to expect the caller to take concurrency into account. That is how the component is designed before we touched it anyway.

I've put the change into this commit: fd42b76

processor/spanmetricsprocessor/processor.go

albertteoh · 2021-12-14T14:46:49Z

processor/spanmetricsprocessor/processor.go

+	return dims
+}
+
+func concatDimensionValue(metricKeyBuilder *strings.Builder, value string) {


I like your suggestion of simplifying this and I propose to build on this further with the following goals in mind:

Remove the assumption that a new/empty metricKeyBuilder should be supplied to concatDimensionValue in order to avoid prefixing the separator.

Remove the assumption that metricKeyBuilder is not nil.

Remove the need to provide a reference to metricKeyBuilder each time, which is a means of maintaining the key building "in progress" state.

Make the usage simpler and more intuitive with a more idiomatic function name and avoid repetition through variadic parameters.

So instead of:

var metricKeyBuilder strings.Builder concatDimensionValue(&metricKeyBuilder, span.Name()) concatDimensionValue(&metricKeyBuilder, span.Kind().String()) concatDimensionValue(&metricKeyBuilder, span.Status().Code().String())

The usage could be simplified to:

mkb := keybuilder.NewMetricKeyBuilder() mkb.Append( serviceName, span.Name(), span.Kind().String(), span.Status().Code().String(), )

My proposal is to separate and encapsulate the stateful concern of metrics key building from the processor into a package processor/spanmetricsprocessor/keybuilder:

package keybuilder import "strings" const ( metricKeySeparator = string(byte(0)) ) type ( MetricKeyBuilder interface { Append(value ...string) String() string } metricKeyBuilder struct { sb strings.Builder separator string } ) func NewMetricKeyBuilder() MetricKeyBuilder { return &metricKeyBuilder{} } func (mkb *metricKeyBuilder) Append(values ...string) { for _, value := range values { // It's worth noting that from pprof benchmarks, WriteString is the most expensive operation of this processor. // Specifically, the need to grow the underlying []byte slice to make room for the appended string. mkb.sb.WriteString(mkb.separator) mkb.sb.WriteString(value) mkb.separator = metricKeySeparator } } func (mkb *metricKeyBuilder) String() string { return mkb.sb.String() }

Please let me know what you think.

I really like this approach, the only thing I would change is the method name of keybuilder.NewMetricKeyBuilder() to keybuilder.New() to avoid package name stuttering.

I am fairly certain that strings.Builder has a .Grow(by int) method that we can use on the NewMethod which can define a package level default of what we consider to be a reasonable size byte slice?

the only thing I would change is the method name of keybuilder.NewMetricKeyBuilder() to keybuilder.New() to avoid package name stuttering.

I am fairly certain that strings.Builder has a .Grow(by int) method that we can use on the NewMethod which can define a package level default of what we consider to be a reasonable size byte slice?

+1 to both suggestions.

github-actions · 2021-12-31T05:15:49Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

- Use LRU cache for the "resource_attributes". - Add document/usage for the "resource_attributes" configuration.

jamesmoessis · 2022-01-06T04:52:55Z

processor/spanmetricsprocessor/processor.go

@@ -234,13 +249,13 @@ func (p *processorImp) ConsumeTraces(ctx context.Context, traces pdata.Traces) e
 		return err
 	}

+	p.reset()


Are you sure resetting here will not affect the metrics in m? If there are reference types I would want to be sure that they wouldn't be affected in the following line.

Yes, I am sure the m should not be impacted. m is a pointer but it's not held by the processor.

I have moved it to a defer function so that we can make the downstream components get the result a little bit quicker.

Actually, kind of concerned with this reset since it's not under any lock. There was a reason that I previously reset exemplar data only inside buildMetrics(). I'm not sure how this got to be called outside of buildMetrics(), but I think that's a mistake. Ideally using the same write lock that is obtained in buildMetrics() to avoid race conditions. Would be nice to have this concurrency verified by tests too.

…uicker.

jamesmoessis · 2022-01-06T05:45:00Z

processor/spanmetricsprocessor/processor.go

 		p.reset()
+		p.lock.Unlock()
 	}()



Say there is thread (a) and (b) both calling ConsumeTraces().

Thread (a) obtains write lock and aggregateMetrics() and releases lock.

Thread (b) obtains write lock and aggregateMetrics(), releases lock and then also goes onto buildMetrics().

Now thread (a) calls buildMetrics() with the data from both (a) and (b). Specifically in the delta case this would produce incorrect metrics.

Do you agree? Potentially buildMetrics and aggregateMetrics should be under the same lock? @MovieStoreGuy any thoughts?

Yeah, it also makes sense. The processor is stateful and should not be used concurrently. In practice, it should not be used concurrently either based on the design of OTEL Collector.

I would change the lock to make sure the processor can only be executed serially.

Regarding this, and from @MovieStoreGuy 's comment:

Locks protect data. If there is no data to protect, you don't need a lock. If something isn't thread safe, then there's data being accessed, by definition.

If the data is "the entire processorImp", that's fine, but it should be documented as such, and locked as such.

Also, if you're saying something isn't thread safe, you don't need a lock. When you declare something as not thread safe, you're explicitly pushing that concern to the caller - if they want to make it thread safe, they can add their own lock. If you have a lock (and you use it properly!), it is thread safe by definition, because that's what locks give you.

Yeah, it also makes sense. The processor is stateful and should not be used concurrently. In practice, it should not be used concurrently either based on the design of OTEL Collector.

I would change the lock to make sure the processor can only be executed serially.

I think I agree with the first statement here, that the processor should not be used concurrently, but I'm not sure about the second. There is nothing in the design of the collector that prevents, prohibits, or even discourages concurrent execution. Any component can spawn new goroutines at any time, and it should be anticipated that receivers will do so regularly. While a pipeline is conceptually a serial process, there may be many instantiations of the pipeline concurrently executing and any component in a pipeline may trigger delayed or asynchronous execution of the next step in the pipeline.

I would fully expect that this processor's ConsumeTraces() method will be invoked concurrently and thus it should ensure that appropriate measures are taken to ensure data consistency. If it needs to ensure a consistent view of its state without the potential for changes by other concurrent invocations, a lock should probably be held for the duration of an invocation.

Thankyou @Aneurysm9 for clearing this up. Fully in favour of holding the lock for the whole invocation then.

By the way, this PR has moved to #7075 (@Tenaria can you close this one now you're back?)

@tiedotguy regarding the lock, it is not protecting the processorImp() but to make it synchronized. I've updated the comment and hope that's more explicitly now: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/7075/files#diff-78f9e7446e0e4fb474786b4b2c262486887bf27479f1f8bc598376d6c3c29b72R72.

But as we discussed here, we probably should consider running multiple sub-processor so that each sub-processing itself is synchronized(still locked by a mutex) but these sub-processors can be executed concurrently.

…xecuted serially

chenzhihao · 2022-01-07T03:25:46Z

As @Tenaria is away(she will be back next week), we are moving this PR to #7075 so that we can edit this PR's description and rename the branch.

Sorry for the inconvenience.

Tenaria · 2022-01-12T02:43:09Z

see previous comment.

Tenaria and others added 8 commits December 13, 2021 13:47

starts initial work on optional resource attributes config to span me…

f5553a1

…trics processor

adds test to check metric length when copying resources

f186407

adds changelog and fixes tests

3206a11

Merge branch 'main' into OBC-256-Resource-Attributes-Merge-Upstream

3da4719

fix tests

945a6d1

remove empty new line

fc62eab

updates changelog

07b5ad3

add accidentally removed comment back in

5782cb0

Tenaria changed the title ~~Obc 256 resource attributes merge upstream~~ [processor/spanmetricsprocessor] Support for specifying Resource Attributes on metrics generated from traces Dec 13, 2021

chenzhihao reviewed Dec 13, 2021

View reviewed changes

processor/spanmetricsprocessor/config.go Outdated Show resolved Hide resolved

chenzhihao reviewed Dec 13, 2021

View reviewed changes

processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved

revert variable name change and fix comment

e2eb41d

chenzhihao reviewed Dec 13, 2021

View reviewed changes

processor/spanmetricsprocessor/processor.go Show resolved Hide resolved

revert variable name cahnge

321c6e2

jamesmoessis reviewed Dec 13, 2021

View reviewed changes

MovieStoreGuy requested changes Dec 13, 2021

View reviewed changes

Tenaria added 4 commits December 14, 2021 07:40

adds test to ensure ordering of keys used for aggregation

fb0561c

fixes lock held; optimise for loop; add parallel in test; add comments

80ebf64

reverse conditions to simplify code

19775dc

update tests to use defined constant instrumentationLibraryName

8c2d38e

albertteoh reviewed Dec 14, 2021

View reviewed changes

jpkrohling changed the title ~~[processor/spanmetricsprocessor] Support for specifying Resource Attributes on metrics generated from traces~~ [processor/spanmetrics] Support for specifying Resource Attributes on metrics generated from traces Dec 16, 2021

github-actions bot added the Stale label Dec 31, 2021

jpkrohling removed the Stale label Jan 3, 2022

chenzhihao added 2 commits January 4, 2022 14:20

Merge branch 'main' into OBC-256-Resource-Attributes-Merge-Upstream

cbbf532

update the logic of method updateCallMetrics

91c8b7d

chenzhihao force-pushed the OBC-256-Resource-Attributes-Merge-Upstream branch from 70aaed2 to 91c8b7d Compare January 5, 2022 03:31

chenzhihao added 2 commits January 5, 2022 17:10

fix tests

068a1bd

add todos

18f20d0

chenzhihao and others added 11 commits January 5, 2022 17:14

add todos

4d370ad

Merge branch 'main' into OBC-256-Resource-Attributes-Merge-Upstream

024a122

fix lint

7561f97

fix lint

62e892f

remove resourceAttrList

d8edfda

Update "resource_attributes"

d175a4e

- Use LRU cache for the "resource_attributes". - Add document/usage for the "resource_attributes" configuration.

fix lint

608a083

remove duplicate p.resetExemplarData()

a0519c0

refactor how processor reset after every ConsumeTraces process

9772056

move keybuilder to a package

ae17b37

update comments

8ea124c

chenzhihao force-pushed the OBC-256-Resource-Attributes-Merge-Upstream branch from ef982d6 to 8ea124c Compare January 6, 2022 04:25

add copyright license

78c11f3

jamesmoessis reviewed Jan 6, 2022

View reviewed changes

chenzhihao added 4 commits January 6, 2022 16:05

user defer for p.reset() so that downstream components can get data q…

fbea55f

…uicker.

gofmt

3d6b4c1

refactor how we use p.lock

fd42b76

gofmt

7ed4b1c

jamesmoessis reviewed Jan 6, 2022

View reviewed changes

chenzhihao added 4 commits January 7, 2022 11:43

added a set of unit tests to cover the excessive concurrent usage

f676e46

change the usage of the lock to make sure the processor can only be e…

4b71675

…xecuted serially

Merge branch 'main' into OBC-256-Resource-Attributes-Merge-Upstream

494b483

add comments; lint

38ebb2a

chenzhihao mentioned this pull request Jan 10, 2022

[processor/spanmetrics] Resource attributes support #7075

Closed

Tenaria closed this Jan 12, 2022

gbbr mentioned this pull request Oct 5, 2022

[spanmetricsprocessor] All metrics extracted from spans has the same resource attributes, despite the spans' resources are different #6486

Closed

[processor/spanmetrics] Support for specifying Resource Attributes on metrics generated from traces #6717

[processor/spanmetrics] Support for specifying Resource Attributes on metrics generated from traces #6717

Conversation

Tenaria commented Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tenaria Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tenaria Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MovieStoreGuy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertteoh Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 31, 2021

Choose a reason for hiding this comment

chenzhihao Jan 6, 2022 • edited Loading

Choose a reason for hiding this comment

jamesmoessis Jan 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenzhihao Jan 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenzhihao commented Jan 7, 2022

Tenaria commented Jan 12, 2022

Tenaria commented Dec 13, 2021 •

edited

Loading

Tenaria Dec 13, 2021 •

edited

Loading

Tenaria Dec 13, 2021 •

edited

Loading

albertteoh Dec 14, 2021 •

edited

Loading

chenzhihao Jan 6, 2022 •

edited

Loading

jamesmoessis Jan 6, 2022 •

edited

Loading

chenzhihao Jan 7, 2022 •

edited

Loading