Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors not being reported for kong plugin Opentelemetry #13776

Open
1 task done
pawandhiman10 opened this issue Oct 20, 2024 · 15 comments
Open
1 task done

Errors not being reported for kong plugin Opentelemetry #13776

pawandhiman10 opened this issue Oct 20, 2024 · 15 comments

Comments

@pawandhiman10
Copy link

pawandhiman10 commented Oct 20, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Kong version ($ kong version)

3.7.1

Current Behavior

We have installed this plugin for traces in Kong.
Plugin link

We are sending traces to Datadog with this. But the traces are only for requests and errors are not captured at all.
Screenshot 2024-10-20 at 10 50 14 PM

Expected Behavior

Errors should be captured and reported in the UI.

Steps To Reproduce

Install this plugin using Helm.
We have attached this plugin to a service with the kong annotations: konghq.com/plugins: opentelemetry
Configuration reference for the kongplugin on the plugin definition page.
This is connected to otel configuration exposed by Datadog agent on http port 4318.

Anything else?

No response

@ProBrian
Copy link
Contributor

@pawandhiman10 can you attach you config file for the OTEL plugin?
@samugi can you help to have a look at this issue?

@ProBrian ProBrian added the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Oct 21, 2024
@samugi
Copy link
Member

samugi commented Oct 21, 2024

hello @pawandhiman10
errors in traces are captured and reported by the OpenTelemetry plugin as span events using the exception event key and the exception.message attribute as defined in the OpenTelemetry specification.

I believe Datadog might be expecting slightly different fields based on this, are you using the OTLP ingestion with the datadog agent? In that case, I would expect the ingestion process of OTLP data to take care of parsing any errors and translating any fields as needed.

I can confirm that errors are displayed correctly by other tools such as Jaeger and Grafana.

Also: could you share an example of an error that is being reported by your system and you are expecting to be visible in the UI?

@pawandhiman10
Copy link
Author

@samugi Errors mean 5xx status code errors here.
As per the screenshot earlier, it only displays the 2xx requests and not the 5xx. These are not actual errors but response is 5xx. This is missing.
Below screenshot gives an idea that 5xx is occuring regularly and also increasing sometimes, we are not able to see these numbers in datadog while the same is visible in the load balancer logs.
Screenshot 2024-10-21 at 12 36 34 PM

Yes, currently the ingestion is based on OLTP ingestion with Datadog agent. Could you point me to some reference docs here on how to get the errors (5xx) parsed?

@samugi
Copy link
Member

samugi commented Oct 22, 2024

@pawandhiman10 in that case, it might be expected. 500 response status codes are not always reported as errors. Today errors are reported for failures (exceptions) that occur during the execution of plugin phase handlers, errors returned by the internal http client and dns failures.

Could you test with a 500 status code that originates from an exception in a plugin? This can be tested with a plugin that throws an explicit error like: error("fail") from its access phase. The expected behavior is for this to generate a span with name kong.access.plugin.yourplugin with error state and an event that contains the exception message. If Datadog is parsing things correctly, this should be displayed in the UI as well.

@pawandhiman10
Copy link
Author

pawandhiman10 commented Oct 26, 2024

@samugi Seems like the error tag is missing with the 5xx status codes. Is there any way to add this tag with >=500 status code from the requests, span.error: true?
Not sure on this, but is there a relation to any of these headers: w3c, b3, jaeger, ot, aws, datadog, gcp?
Can see the error code there but coming under ok status.
Screenshot 2024-10-27 at 2 07 13 AM

@samugi
Copy link
Member

samugi commented Oct 27, 2024

@pawandhiman10 what you describe is currently expected. Status codes are reported (whether they 2xx, 4xx, 5xx, etc), but spans are not being set the error state when this happens. Errors are currently reserved for actual errors occurring during the execution of plugins code (exceptions) and a few other scenarios that don't include specific response codes.

Would you be interested in setting an error state to your root span in case of 5xx status code from your upstream? If that is the case, such a change should probably be configurable, for example some might expect 4xx response codes from time to time, while others would consider them error conditions.

We are always interested in improving/updating our tracing instrumentation, but I cannot guarantee this change in particular will be made.
I would recommend, to achieve this behavior, to write your own code (in a custom or serverless plugin) that applies this simple logic for your trace. You can use the tracing and the response PDK modules to achieve this.

@julianocosta89
Copy link

@pawandhiman10 if you have access to the collector configuration, you can use the transformprocessor to do that:

processors:
  transform:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - set (status.code, 2) where attributes["http.status_code"] >= 500

Here are the enums available, and the explanation why I've suggested set (status.code, 2):
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl/contexts/ottlspan#enums

@pawandhiman10
Copy link
Author

pawandhiman10 commented Oct 29, 2024

@julianocosta89 This is managed by Datadog agent currently without any collector config.

@samugi Could you also help me with setting up OTEL_TRACES_SAMPLER=always_on with this plugin?
Setting it as environment variable did not work as expected. Opentelemetry

As of the error(5xx) reporting in Datadog, will experiment with the code. Thanks for that.

@Oyami-Srk Oyami-Srk removed the pending author feedback Waiting for the issue author to get back to a maintainer with findings, more details, etc... label Nov 4, 2024
@samugi
Copy link
Member

samugi commented Nov 11, 2024

@pawandhiman10 at the moment the sampling strategy is not configurable for this plugin, what behavior in particular are you attempting to configure?

@pawandhiman10
Copy link
Author

@samugi This is required I believe from the OTEL page to set the following env variables in Datadog. ref: Link

Enabling opentelemetry send traces in Datadog with probablistic sampling and we need to control the sampling rate of these traces. For that the following envs need to be set: link
DD_APM_PROBABILISTIC_SAMPLER_ENABLED
DD_APM_PROBABILISTIC_SAMPLER_SAMPLING_PERCENTAGE
DD_APM_PROBABILISTIC_SAMPLER_HASH_SEED

I'm not quite sure on how this works but trying to reduce the ingested spans using this opentelemetry plugin.

@julianocosta89
Copy link

Just to give more context here, what happens is that if you want to use the Datadog with probabilistic sampling, you need to ensure all traces are sent to the Datadog agent, and the agent will do the sampling decision based on your configuration.

The agent can only act on the traces that it receives.

If different services have different sampling decisions, that will probably break some traces.

@samugi
Copy link
Member

samugi commented Nov 28, 2024

@pawandhiman10 @julianocosta89 does the following describe your scenario?

If you use a mixed setup of Datadog tracing libraries and OTel SDKs:
Probabilistic sampling will apply to spans originating from both Datadog and OTel tracing libraries.
If you send spans both to the Datadog Agent and OTel collector instances, set the same seed between Datadog Agent (DD_APM_PROBABILISTIC_SAMPLER_HASH_SEED) and OTel collector (hash_seed) to ensure consistent sampling.

Kong's OpenTelemetry plugin only supports head sampling today so your options for sampling, depending on your architecture choices, would be:

  1. Rely solely on Kong's head sampling via tracing_sampling_rate
  2. Configure tracing_sampling_rate to 1 (sample all) and do tail sampling on the Datadog Agent
  3. Configure tracing_sampling_rate to 1 (sample all), use an OpenTelemetry collector as a sink to receive Kong's tracing data, then configure probabilistic sampling in the OTel collector and the Datadog Agent, as suggested in the documentation you have shared

@julianocosta89
Copy link

So setting tracing_sampling_rate to 1, is the same as OTEL_TRACES_SAMPLER=always_on

@samugi
Copy link
Member

samugi commented Nov 29, 2024

that's right @julianocosta89 1 stands for 100%, i.e. all samples will be traced and reported.

@julianocosta89
Copy link

So setting tracing_sampling_rate to 1, is the same as OTEL_TRACES_SAMPLER=always_on

@pawandhiman10 that's what you are looking for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants
@samugi @Oyami-Srk @ProBrian @julianocosta89 @pawandhiman10 and others