Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug?]: (OpenTelemetry) As of Redwood 6, a single active span appears to be wrapping the entire API server lifetime #9043

Open
1 task
pvenable opened this issue Aug 16, 2023 · 8 comments
Assignees
Labels
bug/needs-info More information is needed for reproduction

Comments

@pvenable
Copy link
Contributor

pvenable commented Aug 16, 2023

What's not working?

We're using a customized opentelemetry plugin in our app that starts an active span for each GraphQL operation. It's similar to the bundled Redwood opentelemetry plugin, but for various reasons we have our own version. (Some relevant history/context here and here.) This has generally worked great for us throughout the 4.x and 5.x release cycles.

We recently deployed Redwood v6 and then noticed that our previously per-operation traces are now all being bundled under a parent trace that appears to span the entire lifetime of our API server -- rather than a distinct trace per operation.

I suspect the source of this new parent trace is

// Execute CLI within a span, this will be the root span
const tracer = trace.getTracer('redwoodjs')
await tracer.startActiveSpan('cli', async (span) => {
, but I can't say so with certainty.

If my analysis is correct, can I prevent this behavior somehow? Or is there another change in v6 that's likely to have caused this?

How do we reproduce the bug?

No response

What's your environment? (If it applies)

yarn rw serve api

Are you interested in working on this?

  • I'm interested in working on this
@pvenable pvenable added the bug/needs-info More information is needed for reproduction label Aug 16, 2023
@Josh-Walker-GM Josh-Walker-GM self-assigned this Aug 16, 2023
@Josh-Walker-GM
Copy link
Collaborator

Thanks @pvenable, I'll try to reproduce this and see if I can fix it this week.

@Josh-Walker-GM
Copy link
Collaborator

Hey @pvenable, could you help me a little? How are you starting your OpenTelemetry SDK when you use yarn rw serve api?

@pvenable
Copy link
Contributor Author

@Josh-Walker-GM Here's some info I hope will be helpful. If this doesn't answer your question, please let me know and I'll see what I can do.

We have three "touchpoints" with opentelemetry in our Redwood app:

  • A file wherein we set up our provider, processor, and exporter as appropriate for the environment.
    • This just exports a createProvider function that returns our properly configured NodeTracerProvider instance based on a parameter we specify per environment.
    • I don't think anything in this configuration is relevant to this issue; I'm just mentioning this for completeness.
  • A customOpenTelemetryPlugin similar to Redwood's useRedwoodOpenTelemetry.
    • We've had to customize this beyond what the built-in plugin currently supports -- for example, to set some custom span attributes.
    • Our plugin calls tracer.startActiveSpan in onExecute -- similar to useRedwoodOpenTelemetry, though our implementation is a bit different.
    • This plugin is added to extraPlugins in our createGraphQLHandler function in api/src/functions/graphql.ts.
    • I don't think this is directly relevant, as we found that even when we disabled our plugin, the reported issue persisted.
  • Some code in our api side that adds a traceparent header to our backend HTTP requests. Our backend services will share the same Trace ID so we can trace across our system when we provide this header. The traceparent value is obtained like this:
import * as opentelemetry from '@opentelemetry/api';

const getTraceparent = () => {
  const output: { traceparent?: string } = {};
  opentelemetry.propagation.inject(opentelemetry.context.active(), output);
  return output.traceparent;
};

I believe what's happening here is that the cli package's active span is spanning the entire server lifetime and superseding our per-request active span. I believe this is a case of "there can be only one", and I think the outermost active span wins.

I'm not familiar with why this active span was added to the CLI, but if the span itself is indeed necessary, my hunch is that were it not "active" it might not cause this issue. I don't have many other ideas yet other than requesting a way to opt out of this outer span.

Thanks, and please let me know if you have any other questions.

@Josh-Walker-GM
Copy link
Collaborator

Thanks so much for the detailed response, very helpful!

I am still hoping to understand better how/when your SDK was created/invoked. I think this would be related to:

A file wherein we set up our provider, processor, and exporter as appropriate for the environment.

I think on the docs for OTel they tend to recommend something like --require sdkFile.js or something similar so that the code you use to setup OTel gets invoked before your application code. For yarn rw dev we do this for you if you have OTel setup but I haven't yet had the time to enable support for OTel in production like this - there are technical blockers we have to address before we can do this. This is why I'm not too sure how you have the OTel SDK setup when you run yarn rw serve api.

@pvenable
Copy link
Contributor Author

Hey, @Josh-Walker-GM -- thanks for sticking with this.

I am still hoping to understand better how/when your SDK was created/invoked. I think this would be related to:

A file wherein we set up our provider, processor, and exporter as appropriate for the environment.

Alright, I'll share a somewhat redacted version of this.

Imports:

import { credentials, Metadata } from '@grpc/grpc-js';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import {
  BatchSpanProcessor,
  ConsoleSpanExporter,
  NodeTracerProvider,
} from '@opentelemetry/sdk-trace-node';

(In case it's relevant, note that we don't generally use ConsoleSpanExporter, but it's a mode we support in case we want to use it in development.)

A createProvider export that actually creates the provider based on the env-configured mode, though I've omitted that from this example:

export const createProvider = () => {
  const provider = new NodeTracerProvider({
    resource: Resource.default().merge(
      new Resource({
        // Omitted: `service.name` and `env` properties
      })
    ),
  });

  const exporter = createExporter();
  provider.addSpanProcessor(new BatchSpanProcessor(exporter));
  provider.register();

  return provider;
};

An internal createExporter function, as referenced above. In reality this is parameterized to decide which Exporter to configure, but this is roughly what we do in production-like environments:

const createExporter = () => {
  const metadata = new Metadata();

  // Omitted: `metadata.set(...)` a couple values here

  return new OTLPTraceExporter({
    credentials: credentials.createSsl(),
    metadata,
    // Omitted: `url: 'grpc://...'`
  });
};

This is referenced only from api/src/functions/graphql.ts, wherein we do basically this at the top level:

const extraPlugins: Plugin[] = [];

if (/* Omitted: conditions under which we would initialize our plugin */) {
  const provider = createProvider();
  const plugin = customOpenTelemetryPlugin(provider);

  extraPlugins.push(plugin);
}

And extraPlugins slots right into the createGraphQLHandler call as previously mentioned.

I can't think of anything else we're doing that would constitute starting or invoking the SDK. We've never passed anything like --require sdkFile.js in our startup command, but everything worked fine before v6. Is there cause to think that would be necessary now when it wasn't before?

I hope something I've shared here is useful. Let me know if you think there's any other info that might help. Thanks!

@pvenable
Copy link
Contributor Author

pvenable commented Sep 6, 2023

I had a chance to look into this again this morning. One of the first things I tried was passing --telemetry false to yarn rw serve api, and this completely resolved the issue for us. We now have a trace per GraphQL operation as we did before the Redwood v6 upgrade.

I'm not sure whether this issue should be closed, because from my perspective there's still unexpected behavior here (even if I can opt out), but if anyone else runs into this, yarn rw serve api --telemetry false seems to be an effective workaround.

@Josh-Walker-GM
Copy link
Collaborator

Sorry for being slow with this one!

I'd keep it open because this is not the behaviour I want to support. Your debugging seems to have pretty much nailed the cause so I'll try and find some time to sort out a fix.

@Josh-Walker-GM
Copy link
Collaborator

I took a look at this tonight and I could kinda reproduce this. I'll need to have a think about the best way to fix it. Thanks for dealing with this in the meantime with the --telemetry=false quick fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/needs-info More information is needed for reproduction
Projects
None yet
Development

No branches or pull requests

2 participants