Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/spanmetrics] Support for specifying Resource Attributes on metrics generated from traces #6717

Conversation

Tenaria
Copy link

@Tenaria Tenaria commented Dec 13, 2021

Description:
Currently there is a bug / no logic to differentiate between resource attributes. This PR adds the feature to allow users to optionally specify resource attributes to append similar to the existing dimensions mechanism.

The logic now also indexes by the specified resource attributes to ensure accurate aggregation of data.

Service Name has been moved to become a default resource attribute instead of attribute as per the resource semantic convention https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/resource/semantic_conventions.

There is a bit of discrepancy between the use of terms "Dimensions" and "Attributes". It seems like "Attributes" is more commonly used so I have used that term in this PR for adding in resource attributes. I propose that we rename dimensions to attributes in the future, altho there will need to be some backwards compatibility added to the config to support this.

Link to tracking Issue:
#6486

Testing:
Testing added to ensure new structure/hierarchy of metrics under instrumentationLibraryMetrics structure under Resource is generated correctly.

Documentation:
Usage of the new config option resource_attributes added to README.md

@Tenaria Tenaria changed the title Obc 256 resource attributes merge upstream [processor/spanmetricsprocessor] Support for specifying Resource Attributes on metrics generated from traces Dec 13, 2021
@@ -5,12 +5,16 @@
## 🛑 Breaking changes 🛑

- `memcachedreceiver`: Update metric names (#6594)
- `spanmetricproccessor`: service.name attribute added as a default resource attribute instead of attribute in generated metrics (#6717)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to update the changelog, the maintainers will do this upon a new release

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw in Zhihao's PR that there was a changelog added (see comment #6503 (review)). @MovieStoreGuy can you confirm whether we update the changelog or the maintainers do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine for us to update the changelog, we have the greatest overall idea of what has changed so I think it makes sense to add it. However, If a maintainer says otherwise, I am also fine with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would definitely prefer that you add a changelog entry, and there is CI automation that will complain if you don't! It generally makes it easier for us to prepare releases when we don't need to track down all the changes that have been made and think about how to describe them.

latencyCount map[metricKey]uint64
latencySum map[metricKey]float64
latencyBucketCounts map[metricKey][]uint64
latencyCount map[resourceKey]map[metricKey]uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this logic need to change, and why does the map structure need to become more complex? Currently it works in a way where each unique metric key is mapped to a unique attribute set. Can the metricKey just take into account the resource attributes, rather than having an entirely new key?

Copy link
Author

@Tenaria Tenaria Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables quicker look up time. The way OTel defines the structure to hold metrics is some thing roughly like (I've omitted the instrumentation library to keep things simplified, if you want to see the actual structure you can check out the metric proto definition):

--> Resource 1
   -> resource attributes (service.name == my.service.1, etc)
   -> metric 1
        -> data point 1
        -> data point 2
  -> metric 2
       -> data point 3
-> Resource 2
    -> resource attributes (service.name == my.service.2, etc)
    -> metric 3
      -> data point 4
etc

This means we need to index on resource key since to build that metric structure we need to find all the metrics/data points that have the same resource attributes and insert that way. (There wasn't any way to lookup on this structure afaict, it's just indexed by integers so we needed to insert all the relevant metrics/dp at the time of creation of the resource/resource attribute structure).

If we continued to use the singular map structure that they used, we'd need to iterate over the entire metric key set and check if the key is prepended with the same resource attributes for every resource attributes - this would increase the time complexity.

resourceAttrKey := resourceKey(key)
resourceAttributesMap := p.resourceKeyToDimensions[resourceAttrKey]

// if the service name doesn't exist, we treat it as invalid and do not generate a trace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a breaking change to those using this processor without service.name?

Copy link
Author

@Tenaria Tenaria Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roughly answered here: #6717 (comment)

but service.name is defined as a mandatory resource attribute by the OTel semantic convention so it should be there. Happy to remove this if we want, it shouldn't break any existing logic. The main breaking change is that we are moving the service.name field from being extracted from a resource attribute and transformed into a metric normal attribute (dimension) to being extracted from a resource attribute and transformed into a metric resource attribute as default behaviour.

processor/spanmetricsprocessor/processor.go Show resolved Hide resolved
ilm.InstrumentationLibrary().SetName("spanmetricsprocessor")

// build metrics per resource
p.collectCallMetrics(ilm, resourceAttrKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we can (probably?) create new metrics under new resources without complicating the maps with resourceAttrKey. Happy to be shown otherwise.

But can't we iterate through the resources and still maintain the same metric keys? They will still be unique and include the resource attributes.

It's just quite a big change and if you want this to be merged upstream then I have worries if it's big and complex, there could be some push back.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could in theory create a new resource for every metric even if they have the same resource attributes but this way is more optimised for the processor in terms of memory. It's a trade-off I guess, simplicity of code vs memory. Personally I think memory optimisation is more important and I don't think the changes are that complex (albeit definitely more complex than the original). Do you think it's more worth it to have simple code?

}
}

k := metricKey(metricKeyBuilder.String())
return k
}

// buildResourceAttrKey builds the metric key from the service name and will attempt to add any additional resource attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we making the assumption that every resource revolves around service name? If so what the is reasoning behind this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTel defines service.name as a mandatory resource attribute (see https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/semantic_conventions/README.md). Originally, the processor defined service.name as a mandatory attribute and fetched it from the resource attributes then set it as a normal attribute. This didn't make much sense to me so I altered the behaviour so that it's a default resource attribute.

Copy link
Contributor

@MovieStoreGuy MovieStoreGuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to go through another pass of this but there is a lot has changed however, I worry that access to internal data resources is not using atomic access and would ultimate lead to either incorrect data being emitted or worst case, a panic.

Could I please ask you to double check each method that reads or depends on values stored inside the process.

processor/spanmetricsprocessor/config.go Outdated Show resolved Hide resolved
processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved
processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved
processor/spanmetricsprocessor/processor.go Show resolved Hide resolved
processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved
func (p *processorImp) collectLatencyMetrics(ilm pdata.InstrumentationLibraryMetrics) {
for key := range p.latencyCount {
func (p *processorImp) collectLatencyMetrics(ilm pdata.InstrumentationLibraryMetrics, resAttrKey resourceKey) {
for mKey := range p.latencyCount[resAttrKey] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not certain but you may have to hold the read lock for this loop since it could change underneath your feet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock is held in the code that calls this function so it should be fine right?

func (p *processorImp) collectCallMetrics(ilm pdata.InstrumentationLibraryMetrics) {
for key := range p.callSum {
func (p *processorImp) collectCallMetrics(ilm pdata.InstrumentationLibraryMetrics, resAttrKey resourceKey) {
for mKey := range p.callSum[resAttrKey] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the same as well, you'll need to hold a lock here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock is held in the code that calls this function so it should be fine right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is fine

func (p *processorImp) updateCallMetrics(key metricKey) {
p.callSum[key]++
func (p *processorImp) updateCallMetrics(resourceAttrKey resourceKey, mKey metricKey) {
if _, ok := p.callSum[resourceAttrKey]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential for lock as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock is held in the code that calls this function so it should be fine right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that someone could mistake these functions safe for concurrent access and trip up.

From what I understand, it is depends on the caller to make sure this happens?

It seems okay since the method is not public, however, it does raise a question if this should be a seperate function or just one big function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This processor is stateful and should not be executed concurrently. We are also using a single mutex for the processor. We should treat all of the methods of this processor as concurrent unsafe. Does that make more sense to take care of lock/unlock only at the top level caller method func (p *processorImp) ConsumeTraces?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConsumeTraces() should be concurrent-safe and that's it. IMO it's kind of impractical to expect every function here to be safe, and it's reasonable to expect the caller to take concurrency into account. That is how the component is designed before we touched it anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put the change into this commit: fd42b76

processor/spanmetricsprocessor/processor.go Outdated Show resolved Hide resolved
return dims
}

func concatDimensionValue(metricKeyBuilder *strings.Builder, value string) {
Copy link
Contributor

@albertteoh albertteoh Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your suggestion of simplifying this and I propose to build on this further with the following goals in mind:

  • Remove the assumption that a new/empty metricKeyBuilder should be supplied to concatDimensionValue in order to avoid prefixing the separator.
  • Remove the assumption that metricKeyBuilder is not nil.
  • Remove the need to provide a reference to metricKeyBuilder each time, which is a means of maintaining the key building "in progress" state.
  • Make the usage simpler and more intuitive with a more idiomatic function name and avoid repetition through variadic parameters.

So instead of:

	var metricKeyBuilder strings.Builder
	concatDimensionValue(&metricKeyBuilder, span.Name())
	concatDimensionValue(&metricKeyBuilder, span.Kind().String())
	concatDimensionValue(&metricKeyBuilder, span.Status().Code().String())

The usage could be simplified to:

        mkb := keybuilder.NewMetricKeyBuilder()
	mkb.Append(
		serviceName,
		span.Name(),
		span.Kind().String(),
		span.Status().Code().String(),
	)

My proposal is to separate and encapsulate the stateful concern of metrics key building from the processor into a package processor/spanmetricsprocessor/keybuilder:

package keybuilder

import "strings"

const (
	metricKeySeparator = string(byte(0))
)

type (
	MetricKeyBuilder interface {
		Append(value ...string)
		String() string
	}

	metricKeyBuilder struct {
		sb        strings.Builder
		separator string
	}
)

func NewMetricKeyBuilder() MetricKeyBuilder {
	return &metricKeyBuilder{}
}

func (mkb *metricKeyBuilder) Append(values ...string) {
	for _, value := range values {
		// It's worth noting that from pprof benchmarks, WriteString is the most expensive operation of this processor.
		// Specifically, the need to grow the underlying []byte slice to make room for the appended string.
		mkb.sb.WriteString(mkb.separator)
		mkb.sb.WriteString(value)
		mkb.separator = metricKeySeparator
	}
}

func (mkb *metricKeyBuilder) String() string {
	return mkb.sb.String()
}

Please let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this approach, the only thing I would change is the method name of keybuilder.NewMetricKeyBuilder() to keybuilder.New() to avoid package name stuttering.

I am fairly certain that strings.Builder has a .Grow(by int) method that we can use on the NewMethod which can define a package level default of what we consider to be a reasonable size byte slice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only thing I would change is the method name of keybuilder.NewMetricKeyBuilder() to keybuilder.New() to avoid package name stuttering.

I am fairly certain that strings.Builder has a .Grow(by int) method that we can use on the NewMethod which can define a package level default of what we consider to be a reasonable size byte slice?

+1 to both suggestions.

@jpkrohling jpkrohling changed the title [processor/spanmetricsprocessor] Support for specifying Resource Attributes on metrics generated from traces [processor/spanmetrics] Support for specifying Resource Attributes on metrics generated from traces Dec 16, 2021
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 31, 2021
@jpkrohling jpkrohling removed the Stale label Jan 3, 2022
@chenzhihao chenzhihao force-pushed the OBC-256-Resource-Attributes-Merge-Upstream branch from 70aaed2 to 91c8b7d Compare January 5, 2022 03:31
@chenzhihao chenzhihao force-pushed the OBC-256-Resource-Attributes-Merge-Upstream branch from ef982d6 to 8ea124c Compare January 6, 2022 04:25
@@ -234,13 +249,13 @@ func (p *processorImp) ConsumeTraces(ctx context.Context, traces pdata.Traces) e
return err
}

p.reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure resetting here will not affect the metrics in m? If there are reference types I would want to be sure that they wouldn't be affected in the following line.

Copy link
Contributor

@chenzhihao chenzhihao Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am sure the m should not be impacted. m is a pointer but it's not held by the processor.

I have moved it to a defer function so that we can make the downstream components get the result a little bit quicker.

Copy link
Contributor

@jamesmoessis jamesmoessis Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, kind of concerned with this reset since it's not under any lock. There was a reason that I previously reset exemplar data only inside buildMetrics(). I'm not sure how this got to be called outside of buildMetrics(), but I think that's a mistake. Ideally using the same write lock that is obtained in buildMetrics() to avoid race conditions. Would be nice to have this concurrency verified by tests too.

p.reset()
p.lock.Unlock()
}()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say there is thread (a) and (b) both calling ConsumeTraces().

  1. Thread (a) obtains write lock and aggregateMetrics() and releases lock.
  2. Thread (b) obtains write lock and aggregateMetrics(), releases lock and then also goes onto buildMetrics().
  3. Now thread (a) calls buildMetrics() with the data from both (a) and (b). Specifically in the delta case this would produce incorrect metrics.

Do you agree? Potentially buildMetrics and aggregateMetrics should be under the same lock? @MovieStoreGuy any thoughts?

Copy link
Contributor

@chenzhihao chenzhihao Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it also makes sense. The processor is stateful and should not be used concurrently. In practice, it should not be used concurrently either based on the design of OTEL Collector.

I would change the lock to make sure the processor can only be executed serially.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this, and from @MovieStoreGuy 's comment:

Locks protect data. If there is no data to protect, you don't need a lock. If something isn't thread safe, then there's data being accessed, by definition.

If the data is "the entire processorImp", that's fine, but it should be documented as such, and locked as such.

Also, if you're saying something isn't thread safe, you don't need a lock. When you declare something as not thread safe, you're explicitly pushing that concern to the caller - if they want to make it thread safe, they can add their own lock. If you have a lock (and you use it properly!), it is thread safe by definition, because that's what locks give you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it also makes sense. The processor is stateful and should not be used concurrently. In practice, it should not be used concurrently either based on the design of OTEL Collector.

I would change the lock to make sure the processor can only be executed serially.

I think I agree with the first statement here, that the processor should not be used concurrently, but I'm not sure about the second. There is nothing in the design of the collector that prevents, prohibits, or even discourages concurrent execution. Any component can spawn new goroutines at any time, and it should be anticipated that receivers will do so regularly. While a pipeline is conceptually a serial process, there may be many instantiations of the pipeline concurrently executing and any component in a pipeline may trigger delayed or asynchronous execution of the next step in the pipeline.

I would fully expect that this processor's ConsumeTraces() method will be invoked concurrently and thus it should ensure that appropriate measures are taken to ensure data consistency. If it needs to ensure a consistent view of its state without the potential for changes by other concurrent invocations, a lock should probably be held for the duration of an invocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankyou @Aneurysm9 for clearing this up. Fully in favour of holding the lock for the whole invocation then.

By the way, this PR has moved to #7075 (@Tenaria can you close this one now you're back?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiedotguy regarding the lock, it is not protecting the processorImp() but to make it synchronized. I've updated the comment and hope that's more explicitly now: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/7075/files#diff-78f9e7446e0e4fb474786b4b2c262486887bf27479f1f8bc598376d6c3c29b72R72.

But as we discussed here, we probably should consider running multiple sub-processor so that each sub-processing itself is synchronized(still locked by a mutex) but these sub-processors can be executed concurrently.

@chenzhihao
Copy link
Contributor

As @Tenaria is away(she will be back next week), we are moving this PR to #7075 so that we can edit this PR's description and rename the branch.

Sorry for the inconvenience.

@Tenaria
Copy link
Author

Tenaria commented Jan 12, 2022

see previous comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants