Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[@opentelemetry/instrumentation-http] Configure custom metric attributes #4107

Closed
klippx opened this issue Sep 1, 2023 · 5 comments
Closed

Comments

@klippx
Copy link

klippx commented Sep 1, 2023

Is your feature request related to a problem? Please describe.

We are deploying canaries and rely heavily on metrics and dashboards to measure how they compare to stable version currently out in production. There is no way to distinguish node http auto instrumentation metrics however, they all get mangled up in the same metric labels.

I have created a discussion but no one was able to help me.

Describe the solution you'd like

I want a clear and intuitive API to configure custom metric attributes that user may inject outside of span.

Describe alternatives you've considered

Inject the attributes from prometheus exporter

This feel wrong for two reasons:

  1. Everything else is programmatically configurable so metrics should be extendable too
  2. A pod might change from "canary" to "stable" during its lifecycle and we only know about this in the app so the attribute needs to come from the app, it will be too hard to do in low level prometheus exporter
Change pod name

By changing pod name we could get exported_job to become different but that would be bad because they are all the same app, just variants of the same app. So all the breaking changes in dashboards will be too wonky to worry about.

@pichlermarc
Copy link
Member

Hi, thanks for opening this issue 🙂

Could be that I'm misunderstanding, but I'm thinking https://opentelemetry.io/docs/instrumentation/js/resources/ might be what you're looking for. Is it possible to distinguish canary and stable by the time the container starts? If yes, then it should be possible to use a resource for it. 🙂

Or does that incur the same problems as with adding it to the Prometheus exporter?

@klippx
Copy link
Author

klippx commented Sep 4, 2023

Thanks for the review, I realise that as a maintainer it is sensitive to add new APIs/features willy nilly. So hopefully if something is added it is done in a reasonable manner that is in line with the existing library ergonomics.

The closest to solving this with resource that we have managed is:

  const resource = new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: traceConfiguration.serviceName,
    [SemanticResourceAttributes.SERVICE_INSTANCE_ID]:
      process.env.POD_NAME ?? traceConfiguration.serviceName,
    [SemanticResourceAttributes.SERVICE_VERSION]: 'v666', // hardcoded example
    [SemanticResourceAttributes.SERVICE_NAMESPACE]: 'canary', // hardcoded example
    ...attributesFromEnv,
  });

  // ...
  const sdk = new opentelemetry.NodeSDK({
    metricReader,
    resource,
    instrumentations: [ getNodeAutoInstrumentations({ ...  }) ],
    // ...
  });

  // Configure meter provider
  sdk.configureMeterProvider({
    views: histogramNames.map(
      (instrumentName) =>
        new View({
          instrumentName,
          aggregation: new ExplicitBucketHistogramAggregation([
            50, 100, 250, 500, 1000, 2500, 5000, 10000, 20000,
          ]),
        }),
    ),
  });

  sdk.start();

The metrics recorded then look like this:

fed_graph_http_client_duration_bucket{
  exported_instance="product-enrichment-dgs", 
  exported_job="canary/product-enrichment-dgs", 
  http_flavor="1.1", 
  http_method="GET", 
  http_status_code="200", 
  instance="otel-collector:8083", 
  job="opentelemetry-collector", 
  le="+Inf", 
  net_peer_name="cce-internal.euwest1.staging.xxx.care", 
  net_peer_port="443"
}

So it seems we got "canary" in namespace, which could work (although as a hack) if this was possible to determine at boot.

However we see a number of issues with this:

  1. It feels like this isn't a namespace. We would rather have a separate attribute called "role" that we can control.
  • If I have more attributes such as service version, do we have to "hack" it in to namespace to get it working?
  1. Why wasn't other attributes such as SERVICE_VERSION added as a tag, in the example that was hardcoded to "v666" ?
  • We are happy they aren't though, because that would explode the dimensions of the metrics. As a user we appreciate to be in control. But maybe we want version to distinguish different canary versions from each other? We should be able to tweak it.
  1. We are using Argo Rollouts, which means a pod can be promoted from "canary" to "stable" at runtime. So it is not determined by the time container starts.
  • We would have to "reconfigure" tracing on a hook such as in pseudo code:
// How to detect this? Can this even be reconfigured after sdk.start() was run? 
// Or do we have to shutdown and recreate it completely?
onPromition() => {
  sdk.configureTracerProvider(
    {
      resource, // resource was updated with namespace "stable"
    }, spanProcessor,
  );
}

We think that a way to extending attributes as a configuration on instrumentation level is in line with other hooks that are available in node auto instrumentation, question is if we nailed the API usage/ergonomics.

@pichlermarc
Copy link
Member

pichlermarc commented Sep 5, 2023

Hmm, you're right; it does not really fit the namespace. I have to say that Prometheus is not really my area of expertise, but IIRC resources are usually exported via the target_info metric, which contains all resource attributes as labels, then one should join with target_info. I've found this blog post that shows how to do this. (Note: In the post linked above, there's also the recommendation to use an OTel Collector to copy the labels from the Resource to the telemetry by using an OTel collector and its transform processor, that would also be an option but would require deploying another component)

If I understand correctly, this should bring the concept of resource attributes into Prometheus, and then it is possible to split by resource attributes, so it should work with any resource attribute instead.

Of course, if the value of the resource attribute is not determined by the time the container starts, then that's challenging because I think we don't have a concept for that in OpenTelemetry at the moment. As such I think the whole topic should be addressed on a specification level rather than on individual instrumentation level, as most likely this would not only concern the telemetry emitted from the http instrumentation, but all other instrumentations (across all languages) as well. I'm aware of open-telemetry/oteps#207, but I think this proposal may require some hacking to make it work for this use-case. Then there's also #4045, which prototypes "global attributes" (only for traces and logs at the moment), which sounds more like what would be necessary here. 🤔

Thinking outside what's available in the app, I'm wondering if collector-contrib's kubernetes attributes processor would be able to apply a stable/canary resource attribute based on a k8s label? 🤔

@dyladan
Copy link
Member

dyladan commented Sep 12, 2023

I have created a #4088 but no one was able to help me.

Have to admit I always forget the discussions feature exists

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Nov 13, 2023
@klippx klippx closed this as completed Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants