Proposal: Span Stats processor #403

ashwinidulams · 2020-07-07T04:13:58Z

Span Statistics Proposal

A processor for aggregating spans to derive RED (Request, Error, Distribution) metrics across a set of dimensions (dynamic or otherwise) and exporting the results to a metrics sink (prometheus endpoint, stream etc).

Motivation

A trace contains end-to-end data about the service call chain in the context of a request/transaction. This data can be mined to derive context rich metrics that can help us identify and isolate the problem as quickly as possible. For example, span level aggregation can be used to detect any performance problems within the process boundary defined by the span.

Instead of forwarding the spans to a separate processor (stream processing application) for aggregation, a otel processor can process the spans in flight and emit RED metrics at configurable time intervals.

Proposed Configuration

processors:
  # name of the processor 
  spanstatsprocessor:
    # time window for aggregation in seconds.
    timewindowseconds: 15
    # dimensions for aggregation. could be an array of aggregations as well.
    # for each set of dimensions, RED (request, error and distribution) metrics will be emitted aggregated per timewindow.
    dimensions:
    - http.url
    - custom_tag1
    # sink for metrics
    sink:
      # prometheus sink with a endpoint exposed at port 24880
      # alternately could be a kafka stream.
      prometheusport: 24880

jrcamp · 2020-07-30T22:11:25Z

Just so it's not lost, @jmacd wrote in the original issue:

See open-telemetry/opentelemetry-specification#657, which aims to specify conventions for this kind of processor.

jmacd · 2020-08-20T16:56:01Z

There is a related discussion open-telemetry/opentelemetry-specification#739.

Also open-telemetry/opentelemetry-specification#381.

This is a great thing and we should have such a processor. I would expect this processor to output OTLP metrics into a metrics pipeline, which I believe is implied by the config snippet above. 👍

pauldraper · 2020-08-28T20:24:17Z

What are the metrics? span.count and span.duration?

Does it make sense to allow attributes as metrics (e.g. response.bytes)?

albertteoh · 2020-10-23T08:08:36Z

I'd like to take on this task if no one's put their hand up yet.

I would expect this processor to output OTLP metrics into a metrics pipeline

@jmacd, this suggestion makes a lot of sense, as I've recently learned from @jpkrohling. As I'm new to OTEL, what do you suggest would be the best approach to achieving this? I haven't put too much thought into it, but my initial thinking is to copy the OTLP exporter's metrics exporting functionality.

If this doesn't sound like the right approach, can you please point me in the right general direction?

albertteoh · 2020-12-10T05:50:51Z

I've put together a Proof of Concept. Here's a screenshot of a Grafana dashboard based on the R.E.D metrics aggregated from span data. It contains a line graph of hit counts (Request rate), Error counts and rates and a histogram of latencies for a specific endpoint (Duration):

albertteoh · 2020-12-10T07:09:51Z

Wondering if I could get some early feedback on this from the community, particularly, the approach taken to transform trace telemetry to metrics, which I've illustrated here:

The approach is to have a common trace receiver that feeds two pipelines, a trace-only pipeline for trace ingestion, and a trace->metrics pipeline that performs aggregations on span data and writes them out to a configured metrics exporter.

The exporter is chosen using a processor configuration like so:

exporters:
   prometheus:
     endpoint: "0.0.0.0:8889"
     ...

processors:
  spanmetrics:
    metrics_exporter: prometheus
    ...

And the code that performs the metrics exporter discovery:

// Start implements the component.Component interface.
func (p *processorImp) Start(ctx context.Context, host component.Host) error {
	exporters := host.GetExporters()

	for k, exp := range exporters[configmodels.MetricsDataType] {
                ...
                // Check if the exporter k has the same name as the configured exporter. e.g. prometheus
		if k.Name() == p.config.MetricsExporter {
			p.metricsExporter = metricsExp
			break
		}
	}
	if p.metricsExporter == nil {
		return fmt.Errorf( "failed to find metrics exporter: '%s'", p.config.MetricsExporter)
	}
	return nil
}

That all seems okay to me so far, though please suggest any improvements to the above proposed approach.

The part that I'm less sure of is the user experience when configuring pipelines, which seems less than ideal due to the following OTEL collector constraints (which make sense by the way):

Needing at least one receiver and at least one exporter in a pipeline.
Exporters are only discoverable if they belong to a pipeline.

Here's the relevant sections of my POC config. Note the need for:

a dummy otlp/spanmetrics receiver (requirement 1.).
an unused exporter (I just used logger) in the traces/spanmetrics pipeline, which aggregates span data to metrics (requirement 1.).
a dummy metrics/spanmetrics pipeline to "house" the prometheus metrics exporter to ensure it's discoverable in the processor Start() function (requirement 2.).

 receivers:
   # Dummy receiver that's never used, because a pipeline is required to have one.
   otlp/spanmetrics:
     protocols:
       grpc:
         endpoint: "localhost:12345"

 service:
   pipelines:

     # Trace-only pipeline for ingesting traces/spans.
     traces:
       receivers: [jaeger]
       processors: [batch, queued_retry]
       exporters: [logging]

     # Dedicated pipeline for receiving spans and emitting metrics.
     # Note the logging exporter is not really used, but is mandatory for pipelines (must have at least one receiver and exporter)
     traces/spanmetrics:
       receivers: [jaeger]
       processors: [spanmetrics]
       exporters: [logging]

     # A dummy 'internal' pipeline dedicated to exposing a metrics exporter for the spanmetrics pipeline.
     # The exporter name must match the metrics_exporter name.
     # The receiver is just a dummy and never used; added because it's mandatory.
     metrics/spanmetrics:
       receivers: [otlp/spanmetrics]
       exporters: [prometheus]

To me, the need for these dummy configuration entries makes for a less than ideal configuration experience for those using this processor and wondering if folks have any suggestions to improve this?

objectiser · 2020-12-10T16:59:43Z

@albertteoh Could there be a concept of a local metrics receiver, that any component could generate metrics for? So then the spanmetrics processor could be part of the normal trace pipeline, and the local metrics receiver could be used to trigger a single metrics pipeline.

receivers:
  otlp:
    protocols:
      grpc:
         endpoint: "localhost:12345"

  local/mymetrics:

exporters:
   prometheus:
     endpoint: "0.0.0.0:8889"
     ...

processors:
  spanmetrics:
    metrics_receiver: local/mymetrics

service:
  pipelines:
   traces:
     receivers: [otlp]
     processors: [spanmetrics, batch, queued_retry]
     exporters: [logging]
   metrics:
     receivers: [local/mymetrics]
     exporters: [prometheus]

albertteoh · 2020-12-11T08:56:55Z

Thanks @objectiser, that config looks cleaner. Please correct me if I'm wrong; I think the spanmetrics processor still needs an exporter as they expose the ConsumeMetrics function for sending metrics (i.e. to trigger a pipeline), whereas receivers don't offer such means.

Your suggestion got me thinking though... I'd like to understand the motivation for needing at least one receiver in a pipeline? Removing this requirement would mean a local receiver is no longer needed for the use case of producing metrics from traces, and we simply configure what's necessary (@tigrannajaryan @bogdandrutu, what do you think?):

receivers:
   jaeger:
     protocols:
       thrift_http:
         endpoint: "0.0.0.0:14278"

exporters:
   prometheus:
     endpoint: "0.0.0.0:8889"
     ...

processors:
  spanmetrics:
    metrics_exporter: prometheus

service:
  pipelines:
   traces:
     receivers: [jaeger]
     processors: [spanmetrics, batch, queued_retry]
     exporters: [logging]
   metrics:
     exporters: [prometheus]

objectiser · 2020-12-11T09:07:23Z

@albertteoh My suggestions were not based on the current internal implementation, but focusing first on how it might be simplified for configuration - and then address how to deal with this "local" receiver as a second consideration.

I like your updated suggestion as well - my only comment would be that it would not allow any processors to be defined on the metrics generated by the spanmetrics processor - whereas if those metrics could be passed to a local receiver, it still offers the full metrics pipeline to process the metrics and potentially export to multiple destinations.

djaglowski · 2020-12-17T00:59:34Z

Here is a very rough draft of an alternate proposal. I'm providing here in case it may generate useful discussion or ideas.

objectiser · 2020-12-17T11:21:56Z

@djaglowski Sounds reasonable - although for clarity could you add an example OTC config that addresses this use case, for example:

tracing data, processed to derive metrics
tracing data then needs to be further processed before being exported
derived metrics then exported to another metrics pipeline that maybe does additional processing before exporting to the desired backend

djaglowski · 2020-12-17T14:44:41Z

@objectiser, I've updated the doc to include your example scenario and a possible configuration for it. This example highlights a gap in my proposal, which I've also addressed at the end. Thanks for reviewing, and for the suggestion.

objectiser · 2020-12-17T15:58:15Z

@djaglowski Looks good, thanks for the update.

albertteoh · 2020-12-21T11:11:17Z

Thanks for putting together the proposal, @djaglowski.

I feel an important requirement for trace -> metrics aggregation is having access to all span data before any filtering (like tail-based sampling) and potentially after span modification processors are applied.

If I understand correctly, the signal translators are implemented as exporters, which I believe would be subject to all processors in the pipeline being executed first before reaching the exporter and hence potentially result in lower-fidelity metrics.

djaglowski · 2020-12-22T15:45:14Z

@albertteoh I believe my proposal covers this, though the usability may be a bit poor.

In the example shown in the doc, copied below, pipeline a is a forking mechanism for the raw traces, so that they can be 1) passed along to pipeline b without modification, and 2) processed into metrics and passed to pipeline c.

receivers:
  some/tracingreceiver:
  otlp/tracingreceiver:
    endpoint: "0.0.0.0:1111"
  otlp/metricsreceiver:
    endpoint: "0.0.0.0:2222"
  
processors:
  some/tracingprocessor:
  some/metricsprocessor:
  
exporters:
  some/tracingexporter:
  some/metricsexporter:
  otlp/tracingexporter:
    endpoint: "0.0.0.0:1111"
  tracestometrics: # perform some derivation and behave as otlp/metricsexporter
    endpoint: "0.0.0.0:2222"

service:
  pipelines:
    a:
      receivers: [some/tracingreceiver]
      exporters: [otlp/tracingexporter, tracestometrics]
    b:
      receivers: [some/tracingreceiver]
      processors: [some/tracingprocessor]
      exporters: [some/tracingexporter]
    c:
      receivers: [some/metricsreceiver]
      processors: [some/metricsprocessor]
      exporters: [some/metricsexporter]

Diagram:

djaglowski · 2020-12-22T17:40:32Z

A new type of component would simplify the above proposal. This is pretty similar to the "local receiver" idea proposed by @objectiser.

I've expanded upon this in the doc, but at a high level, the idea is that translators could be used to link pipelines together:

receivers:
  some/tracingreceiver:
  
processors:
  some/tracingprocessor:
  some/metricsprocessor:
  
exporters:
  some/tracingexporter:
  some/metricsexporter:

translators: # must be used as both exporter and receiver
  tracestotraces:
  tracestometrics:

service:
  pipelines:
    a:
      receivers: [some/tracingreceiver]
      exporters: [tracestotraces, tracestometrics]
    b:
      receivers: [tracestotraces]
      processors: [some/tracingprocessor]
      exporters: [some/tracingexporter]
    c:
      receivers: [tracestometrics]
      processors: [some/metricsprocessor]
      exporters: [some/metricsexporter]

objectiser · 2020-12-22T19:22:11Z

@djaglowski Haven't looked in the doc, so possibly explained there - but what is the purpose of the Traces to Traces Translater? Seems redundant as just dealing with the trace pipeline.

djaglowski · 2020-12-22T19:38:25Z

@objectiser It would probably make more sense if described as a Traces Forwarder. The reasoning for this to be included here at all, is that it would be a bridge between pipelines.

Ultimately, this is all in the context of trying to work with the existing pipeline format, where a fan out happens before Exporters. A "Translator" would be an abstraction of an Exporter/Receiver combo that bridges two pipelines.

objectiser · 2020-12-22T19:45:29Z

@djaglowski Ok thanks for the explanation - I'm not familiar with the internals of the collector so may be wrong - but I thought Traces Receiver could just be shared by pipeline A and B (i.e. also used as the receiver for pipeline B instead of the Traces to Traces Translator) to achieve the same result?

djaglowski · 2020-12-22T19:54:31Z

@objectiser You might be right about being able to reuse components.

However, for the purpose of chaining together pipelines, I believe it would become ambiguous which pipeline you intend to emit to.

The Traces Forwarder would be an abstraction on both a receiver and an exporter, both of which are configured to communicate on a specific port.

So the idea here was to encapsulate all the details of linking the two together, and just have a single component instead.

albertteoh · 2020-12-29T07:20:09Z

Thanks for the additional details @djaglowski. I like the translator/forwarder/bridge concept, especially given it tightly couples the exporter and receiver link between pipelines which avoids ambiguity on where exporters will route requests to, and hence a better user experience. It satisfies the requirement for chaining pipelines and "mixed" telemetry type pipelines, while IIUC, maintaining backwards compatibility with existing interfaces.

I also agree that your proposal satisfies the requirement of aggregating traces to metrics before processing.

tigrannajaryan · 2021-01-06T16:25:03Z

@djaglowski I commented on the design doc.

**Description:** Adds spanmetricsprocessor logic as a follow-up to #1917. **Link to tracking Issue:** #403 **Testing:** - Unit and integration tested locally (by adding to `cmd/otelcontribcol/components.go`). - Performance tested (added metrics key cache to improve performance) and added benchmark function. **Documentation:** Add relevant documentation to exported structs and functions.

tigrannajaryan · 2021-01-28T23:16:58Z

Is this now done or more work is planned?

albertteoh · 2021-01-29T00:26:26Z

@tigrannajaryan, I've been following the contributing guidelines so there's one last PR required which is to enable the spanmetrics processor component, if you're happy for me to go ahead with it.

Signed-off-by: albertteoh <albert.teoh@logz.io> **Description:** Enables the spanmetricsprocessor for use in otel collector configs as a follow-up to #2012. **Link to tracking Issue:** Fixes #403 **Testing:** Integration tested locally against a prebuilt Grafana dashboard to confirm metrics look correct.

halayli · 2021-09-11T00:23:59Z

If the goal of this processor is to convert traces to a metrics, then It should focus on extraction and shouldn't do any aggregation but rather pass the metrics to a metrics processor. And as far as I can tell, in its current implementation it's ignoring the span events which are as important as span attributes.

albertteoh · 2021-11-02T08:49:02Z

@halayli could you elaborate more on what you mean, particularly about extraction and span events?

Tenaria · 2022-01-20T01:15:38Z

Agree with @halayli's comment above that the span metrics processor shouldn't aggregate but simply extract the metrics for the trace and forward. I feel like the span metrics processor is becoming a bit convoluted due to the added layer of complexity introduced by aggregation that is further complicated as we add in more features.

In my opinion if aggregation is desired, that can be a part of a separate processor - I saw this open issue #4968 which is highly relevant in that case. That way we don't have all these processors that are all doing multiple things (e.g it sounds like the metric transform processor is also doing some aggregation). I think it'll help us moving forward.

What are other people's thoughts around this (maintainers/contributors/other users /etc.), would you be receptive to this change in removing the aggregation?

tonychoe · 2022-01-20T20:58:22Z

I agree that we better keep this processor to generating span-metrics only. I prefer to do aggregation via the metrics backend or other otelcol processor.

albertteoh · 2022-01-21T09:44:31Z

Thanks @Tenaria for bringing up this topic again.

+1 to separating the span to metrics translation from the metrics aggregation concern.

Reading over the comments again, I think I've got a better appreciation of @halayli's the suggestion (although I'm still not sure what span events refer to). I noticed in the last few PRs that there was a fair bit of discussion around locking, concurrency and performance, and I suspect these concerns should be mitigated if we remove aggregation.

When spanmetrics processor first came to being, the goal was just to get something working quickly to prove its utility. Clearly, it's evolved to the point where the need to decouple the aggregation and translation concerns is quite evident.

Fortunately, in @tigrannajaryan's greater wisdom, the spanmetrics processor is still in "experimental" mode so I'm not too concerned about introducing breaking changes at this stage, and this will be a breaking change.

tigrannajaryan · 2022-01-21T14:57:17Z

I support the idea of separating the concerns of extracting metric values from spans (which can be this specialized processor) vs aggregating the extracted metrics (which can be a generic metric aggregation processor).

However if we are doing breaking change I suggest that we first announce it before start making changes (in Slack and in SIG meeting), put together a 1-pager that describes how the new approach will work, how it will substitute the current approach and how the changes will be rolled out (very rough plan with rollout steps would be great to have). Since the new approach relies on having the metric aggregation processor we must ensure that metric aggregation processor exists before the aggregation features are removed from this processor (otherwise we will end up in temporary a state where we don't have feature parity with the current processor).

If anyone is willing to drive this change then the maintainers/approvers will be happy to review.

jamesmoessis · 2022-01-23T23:59:03Z

@tigrannajaryan it seems the metrictransformprocessor has aggregation functionalities, albeit it looks a little outdated from first glance. Is there still interest in a completely new and separate metricsaggregatorprocessor?

tigrannajaryan · 2022-01-24T15:24:04Z

I would expect that metrictransformprocessor should be able to do the aggregation that we need.

There is an effort to refine the metrictransformprocessor, which I am not following closely, so I don't know where exactly they are right now: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aissue+is%3Aopen++label%3A%22proc%3A+metricstransformprocessor%22+

jmacd · 2022-01-24T20:02:52Z

What I'd like to see in the future / it would be cool if a span exporter would use an OpenTelemetry metrics API MeterProvider configured to use a private instance of an OTel-Go Metrics SDK configured to output to a metrics receiver.

There's a 1:1 correspondence between Spans arriving and Meter(library_name).Counter(span_name).Add(adjusted_count) API events, so the span receiver would be extremely simple and stateless itself.

The OTel-Go SDK would do the aggregation for you, and note that it's performing aggregation over counter events, not arbitrary metrics data points. The logic required inside an SDK for aggregating events into OTLP is substantially simpler than a generic metrics aggregation processor, that's why I think this is a good idea.

tigrannajaryan · 2022-01-25T00:12:47Z

What I'd like to see in the future / it would be cool if a span exporter would use an OpenTelemetry metrics API MeterProvider configured to use a private instance of an OTel-Go Metrics SDK configured to output to a metrics receiver.

It will likely not have the best performance due to extra conversions and (de)serialization. For better performance we need to stay in pdata space.

MovieStoreGuy · 2022-01-25T03:14:24Z

It will likely not have the best performance due to extra conversions and (de)serialization. For better performance we need to stay in pdata space.

Did someone say generics?

But I agree, moving towards a scoped meter that can remain in pdata space sounds like the best solution for this.

…nic when merging into a empty one (#402) (#403)

ambition-consulting · 2022-10-31T12:23:23Z

I've put together a Proof of Concept. Here's a screenshot of a Grafana dashboard based on the R.E.D metrics aggregated from span data. It contains a line graph of hit counts (Request rate), Error counts and rates and a histogram of latencies for a specific endpoint (Duration):

Hey @albertteoh, do you have the grafana config for this available somewhere? I'm using your plugin and would like to visualize those metrcis now that I got it working.

Sweet work btw. :)

albertteoh · 2022-11-01T09:43:57Z

@ambition-consulting you can try: https://snapshots.raintank.io/dashboard/snapshot/J1haxtp7oBF1BpgL7JBqHzZgVNxlAlWW

ambition-consulting · 2022-11-02T13:59:19Z

@ambition-consulting you can try: https://snapshots.raintank.io/dashboard/snapshot/J1haxtp7oBF1BpgL7JBqHzZgVNxlAlWW

@albertteoh this looks great, but the values get hard coded on export. can I please (!) get a json with the functions being applied on the output of the spanmetricsprocessor?

my asumption is that this is what the dashboard does? or are those just dummy values that don't come from spanmetric processor?

ashwinidulams added the feature request label Jul 7, 2020

jrcamp mentioned this issue Jul 30, 2020

Proposal: Span Stats Processor open-telemetry/opentelemetry-collector#1263

Closed

jrcamp added the help wanted Extra attention is needed label Jul 30, 2020

jrcamp added this to the Backlog milestone Jul 30, 2020

This was referenced Dec 28, 2020

Add spanmetricsprocessor #1914

Closed

Add spanmetricsprocessor readme, config , factory, tests #1917

Merged

djaglowski mentioned this issue Jan 6, 2021

Signal Translation Proprosal open-telemetry/opentelemetry-collector#2336

Closed

andrewhsu added the enhancement New feature or request label Jan 6, 2021

albertteoh mentioned this issue Jan 21, 2021

Create a metrics query endpoint jaegertracing/jaeger#2736

Closed

albertteoh mentioned this issue Jan 29, 2021

Enable spanmetricsprocessor component #2225

Merged

tigrannajaryan closed this as completed in #2225 Jan 29, 2021

albertteoh mentioned this issue Feb 9, 2021

Request for minor release 0.20.0 #2309

Closed

chenzhihao mentioned this issue Jan 20, 2022

Inherit instrumentation library name atlassian-forks/opentelemetry-collector-contrib#1364

Merged

chenzhihao mentioned this issue Jan 24, 2022

[processor/spanmetrics] Resource attributes support #7075

Closed

ljmsc referenced this issue in ljmsc/opentelemetry-collector-contrib Feb 21, 2022

initialize checkpoint when creating ddsketch aggregator to prevent pa…

4f88422

…nic when merging into a empty one (#402) (#403)

gbbr mentioned this issue Oct 5, 2022

[spanmetricsprocessor] All metrics extracted from spans has the same resource attributes, despite the spans' resources are different #6486

Closed

codeboten pushed a commit that referenced this issue Nov 23, 2022

Update redis instrumentation to follow semantic conventions (#403)

634c2ac

thpierce mentioned this issue Jan 25, 2023

Introduce means of producing metrics from all Spans regardless of sampling decision. open-telemetry/opentelemetry-specification#3145

Open

Proposal: Span Stats processor #403

Proposal: Span Stats processor #403

Comments

ashwinidulams commented Jul 7, 2020

Span Statistics Proposal

Motivation

Proposed Configuration

jrcamp commented Jul 30, 2020

jmacd commented Aug 20, 2020

pauldraper commented Aug 28, 2020 • edited Loading

albertteoh commented Oct 23, 2020

albertteoh commented Dec 10, 2020 • edited Loading

albertteoh commented Dec 10, 2020 • edited Loading

objectiser commented Dec 10, 2020

albertteoh commented Dec 11, 2020 • edited Loading

objectiser commented Dec 11, 2020

djaglowski commented Dec 17, 2020

objectiser commented Dec 17, 2020

djaglowski commented Dec 17, 2020

objectiser commented Dec 17, 2020

albertteoh commented Dec 21, 2020 • edited Loading

djaglowski commented Dec 22, 2020 • edited Loading

djaglowski commented Dec 22, 2020 • edited Loading

objectiser commented Dec 22, 2020

djaglowski commented Dec 22, 2020 • edited Loading

objectiser commented Dec 22, 2020

djaglowski commented Dec 22, 2020

albertteoh commented Dec 29, 2020

tigrannajaryan commented Jan 6, 2021

tigrannajaryan commented Jan 28, 2021

albertteoh commented Jan 29, 2021

halayli commented Sep 11, 2021

albertteoh commented Nov 2, 2021

Tenaria commented Jan 20, 2022 • edited Loading

tonychoe commented Jan 20, 2022

albertteoh commented Jan 21, 2022

tigrannajaryan commented Jan 21, 2022

jamesmoessis commented Jan 23, 2022

tigrannajaryan commented Jan 24, 2022

jmacd commented Jan 24, 2022

tigrannajaryan commented Jan 25, 2022

MovieStoreGuy commented Jan 25, 2022 • edited Loading

ambition-consulting commented Oct 31, 2022 • edited Loading

albertteoh commented Nov 1, 2022

ambition-consulting commented Nov 2, 2022 • edited Loading

pauldraper commented Aug 28, 2020 •

edited

Loading

albertteoh commented Dec 10, 2020 •

edited

Loading

albertteoh commented Dec 10, 2020 •

edited

Loading

albertteoh commented Dec 11, 2020 •

edited

Loading

albertteoh commented Dec 21, 2020 •

edited

Loading

djaglowski commented Dec 22, 2020 •

edited

Loading

djaglowski commented Dec 22, 2020 •

edited

Loading

djaglowski commented Dec 22, 2020 •

edited

Loading

Tenaria commented Jan 20, 2022 •

edited

Loading

MovieStoreGuy commented Jan 25, 2022 •

edited

Loading

ambition-consulting commented Oct 31, 2022 •

edited

Loading

ambition-consulting commented Nov 2, 2022 •

edited

Loading