[Bug?]: (OpenTelemetry) As of Redwood 6, a single active span appears to be wrapping the entire API server lifetime #9043

pvenable · 2023-08-16T00:06:00Z

What's not working?

We're using a customized opentelemetry plugin in our app that starts an active span for each GraphQL operation. It's similar to the bundled Redwood opentelemetry plugin, but for various reasons we have our own version. (Some relevant history/context here and here.) This has generally worked great for us throughout the 4.x and 5.x release cycles.

We recently deployed Redwood v6 and then noticed that our previously per-operation traces are now all being bundled under a parent trace that appears to span the entire lifetime of our API server -- rather than a distinct trace per operation.

I suspect the source of this new parent trace is

redwood/packages/cli/src/index.js

Lines 120 to 122 in 865c908

    
           // Execute CLI within a span, this will be the root span 
        
           const tracer = trace.getTracer('redwoodjs') 
        
           await tracer.startActiveSpan('cli', async (span) => {

, but I can't say so with certainty.

If my analysis is correct, can I prevent this behavior somehow? Or is there another change in v6 that's likely to have caused this?

How do we reproduce the bug?

No response

What's your environment? (If it applies)

yarn rw serve api

Are you interested in working on this?

I'm interested in working on this

The text was updated successfully, but these errors were encountered:

Josh-Walker-GM · 2023-08-16T15:01:09Z

Thanks @pvenable, I'll try to reproduce this and see if I can fix it this week.

Josh-Walker-GM · 2023-08-23T00:49:01Z

Hey @pvenable, could you help me a little? How are you starting your OpenTelemetry SDK when you use yarn rw serve api?

pvenable · 2023-08-23T03:07:12Z

@Josh-Walker-GM Here's some info I hope will be helpful. If this doesn't answer your question, please let me know and I'll see what I can do.

We have three "touchpoints" with opentelemetry in our Redwood app:

A file wherein we set up our provider, processor, and exporter as appropriate for the environment.
- This just exports a createProvider function that returns our properly configured NodeTracerProvider instance based on a parameter we specify per environment.
- I don't think anything in this configuration is relevant to this issue; I'm just mentioning this for completeness.
A customOpenTelemetryPlugin similar to Redwood's useRedwoodOpenTelemetry.
- We've had to customize this beyond what the built-in plugin currently supports -- for example, to set some custom span attributes.
- Our plugin calls tracer.startActiveSpan in onExecute -- similar to useRedwoodOpenTelemetry, though our implementation is a bit different.
- This plugin is added to extraPlugins in our createGraphQLHandler function in api/src/functions/graphql.ts.
- I don't think this is directly relevant, as we found that even when we disabled our plugin, the reported issue persisted.
Some code in our api side that adds a traceparent header to our backend HTTP requests. Our backend services will share the same Trace ID so we can trace across our system when we provide this header. The traceparent value is obtained like this:

import * as opentelemetry from '@opentelemetry/api';

const getTraceparent = () => {
  const output: { traceparent?: string } = {};
  opentelemetry.propagation.inject(opentelemetry.context.active(), output);
  return output.traceparent;
};

I believe what's happening here is that the cli package's active span is spanning the entire server lifetime and superseding our per-request active span. I believe this is a case of "there can be only one", and I think the outermost active span wins.

I'm not familiar with why this active span was added to the CLI, but if the span itself is indeed necessary, my hunch is that were it not "active" it might not cause this issue. I don't have many other ideas yet other than requesting a way to opt out of this outer span.

Thanks, and please let me know if you have any other questions.

Josh-Walker-GM · 2023-08-26T01:26:38Z

Thanks so much for the detailed response, very helpful!

I am still hoping to understand better how/when your SDK was created/invoked. I think this would be related to:

A file wherein we set up our provider, processor, and exporter as appropriate for the environment.

I think on the docs for OTel they tend to recommend something like --require sdkFile.js or something similar so that the code you use to setup OTel gets invoked before your application code. For yarn rw dev we do this for you if you have OTel setup but I haven't yet had the time to enable support for OTel in production like this - there are technical blockers we have to address before we can do this. This is why I'm not too sure how you have the OTel SDK setup when you run yarn rw serve api.

pvenable · 2023-08-26T05:35:30Z

Hey, @Josh-Walker-GM -- thanks for sticking with this.

I am still hoping to understand better how/when your SDK was created/invoked. I think this would be related to:

A file wherein we set up our provider, processor, and exporter as appropriate for the environment.

Alright, I'll share a somewhat redacted version of this.

Imports:

import { credentials, Metadata } from '@grpc/grpc-js';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import {
  BatchSpanProcessor,
  ConsoleSpanExporter,
  NodeTracerProvider,
} from '@opentelemetry/sdk-trace-node';

(In case it's relevant, note that we don't generally use ConsoleSpanExporter, but it's a mode we support in case we want to use it in development.)

A createProvider export that actually creates the provider based on the env-configured mode, though I've omitted that from this example:

export const createProvider = () => {
  const provider = new NodeTracerProvider({
    resource: Resource.default().merge(
      new Resource({
        // Omitted: `service.name` and `env` properties
      })
    ),
  });

  const exporter = createExporter();
  provider.addSpanProcessor(new BatchSpanProcessor(exporter));
  provider.register();

  return provider;
};

An internal createExporter function, as referenced above. In reality this is parameterized to decide which Exporter to configure, but this is roughly what we do in production-like environments:

const createExporter = () => {
  const metadata = new Metadata();

  // Omitted: `metadata.set(...)` a couple values here

  return new OTLPTraceExporter({
    credentials: credentials.createSsl(),
    metadata,
    // Omitted: `url: 'grpc://...'`
  });
};

This is referenced only from api/src/functions/graphql.ts, wherein we do basically this at the top level:

const extraPlugins: Plugin[] = [];

if (/* Omitted: conditions under which we would initialize our plugin */) {
  const provider = createProvider();
  const plugin = customOpenTelemetryPlugin(provider);

  extraPlugins.push(plugin);
}

And extraPlugins slots right into the createGraphQLHandler call as previously mentioned.

I can't think of anything else we're doing that would constitute starting or invoking the SDK. We've never passed anything like --require sdkFile.js in our startup command, but everything worked fine before v6. Is there cause to think that would be necessary now when it wasn't before?

I hope something I've shared here is useful. Let me know if you think there's any other info that might help. Thanks!

pvenable · 2023-09-06T17:03:49Z

I had a chance to look into this again this morning. One of the first things I tried was passing --telemetry false to yarn rw serve api, and this completely resolved the issue for us. We now have a trace per GraphQL operation as we did before the Redwood v6 upgrade.

I'm not sure whether this issue should be closed, because from my perspective there's still unexpected behavior here (even if I can opt out), but if anyone else runs into this, yarn rw serve api --telemetry false seems to be an effective workaround.

Josh-Walker-GM · 2023-09-06T18:35:21Z

Sorry for being slow with this one!

I'd keep it open because this is not the behaviour I want to support. Your debugging seems to have pretty much nailed the cause so I'll try and find some time to sort out a fix.

Josh-Walker-GM · 2023-10-02T23:49:57Z

I took a look at this tonight and I could kinda reproduce this. I'll need to have a think about the best way to fix it. Thanks for dealing with this in the meantime with the --telemetry=false quick fix.

pvenable added the bug/needs-info More information is needed for reproduction label Aug 16, 2023

Josh-Walker-GM self-assigned this Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug?]: (OpenTelemetry) As of Redwood 6, a single active span appears to be wrapping the entire API server lifetime #9043

[Bug?]: (OpenTelemetry) As of Redwood 6, a single active span appears to be wrapping the entire API server lifetime #9043

pvenable commented Aug 16, 2023 •

edited

Loading

Josh-Walker-GM commented Aug 16, 2023

Josh-Walker-GM commented Aug 23, 2023

pvenable commented Aug 23, 2023

Josh-Walker-GM commented Aug 26, 2023

pvenable commented Aug 26, 2023

pvenable commented Sep 6, 2023

Josh-Walker-GM commented Sep 6, 2023

Josh-Walker-GM commented Oct 2, 2023

[Bug?]: (OpenTelemetry) As of Redwood 6, a single active span appears to be wrapping the entire API server lifetime #9043

[Bug?]: (OpenTelemetry) As of Redwood 6, a single active span appears to be wrapping the entire API server lifetime #9043

Comments

pvenable commented Aug 16, 2023 • edited Loading

What's not working?

How do we reproduce the bug?

What's your environment? (If it applies)

Are you interested in working on this?

Josh-Walker-GM commented Aug 16, 2023

Josh-Walker-GM commented Aug 23, 2023

pvenable commented Aug 23, 2023

Josh-Walker-GM commented Aug 26, 2023

pvenable commented Aug 26, 2023

pvenable commented Sep 6, 2023

Josh-Walker-GM commented Sep 6, 2023

Josh-Walker-GM commented Oct 2, 2023

pvenable commented Aug 16, 2023 •

edited

Loading