Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cold starts double to quadruple when layer is included #228

Open
ianwremmel opened this issue Mar 1, 2022 · 40 comments
Open

Cold starts double to quadruple when layer is included #228

ianwremmel opened this issue Mar 1, 2022 · 40 comments

Comments

@ianwremmel
Copy link

Is your feature request related to a problem? Please describe.
I'm just getting started with OpenTelemetry and just the act of adding the ADOT NodeJS layer to my functions makes the simplest ping function from 1-2 to 4-5 seconds and my remix app's cold start goes from 8 seconds to more than 30 (I've got the function set to timeout at 30).

Is there any configuration that I can do to improve that or is that just the way it has to be?

Describe the solution you'd like
Minimal impact on cold start times.

Describe alternatives you've considered

  • use standard xray libs
  • use otel with non-AWS providers and send directly from application code
@NathanielRN
Copy link
Contributor

Hi @ianwremmel, thanks for raising this issue!

We've actually gotten issues on things like this before. A user on the OpenTelemetry CNCF Slack channel did a useful deep dive into investigating cold start impacts of Lambda with the ADOT NodeJS Lambda Layer.

Unfortunately it seems that the best solution is to bump up the memory capacity of your Lambda which does improve Cold Start time. If your Lambda is being used a lot this should prove to be a good idea.

Let me know if that helps!

@ianwremmel
Copy link
Author

Gotcha, thanks!

Could you share that post somehow? I don't seem to be able to access that Slack.

@ianwremmel
Copy link
Author

After a bit of experimenting (and seeing a serendipitous tweet this morning), I've determined that my cold start times without the layer were bad due to running the the --source-map-support flag. Once that was removed, my ping function cold start time was reduced to about 700ms with a MemorySize of 128MB, which seems pretty tolerable to me.

With the unexpected overhead of --source-map-support removed and the layer added, I still see ping taking 4 to 5 seconds and my remix app (MemorySize of 1024MB) still times out. If I up MemorySize to 8192 for all functions, I end up with ping at 3.6 seconds and the remix app at 11 seconds.

So, unless you've got other suggestions, it sounds like the AWS otel lambda support isn't viable for my project.

@NathanielRN
Copy link
Contributor

Could you share that post somehow? I don't seem to be able to access that Slack.

image

image

image

Here is the Discussion on OTel JS that the user linked.

@ianwremmel
Copy link
Author

Thanks! yea, it looks I'm in kind of the same boat. I didn't see the kind of speedup he did by increasing memory, but even if I had, that would be way too expensive. I don't think there's any way I can use AWS otel, given these numbers, but maybe there's still a chance to use otel instrumentation and ship directly to third-party receivers (e.g., honeycomb).

@NathanielRN
Copy link
Contributor

One improvement you might try to reduce the cold start time is to use ESBuild instead of CommonJS build for your Node JS app on Lambda. ESBuild is nicer because it does operations like tree-shaking to seriously reduce the size of imported libraries which reduces memory and improves performance.

There are discussions upstream on OpenTelemetry JS Contrib to improve support for ESBuild because it requires more careful configuration. You might find the comment thread at open-telemetry/opentelemetry-js-contrib#647 (comment) helpful towards this goal and improve your Node App's memory impact in a way that would make ADOT Lambda layers usable for your purpose 🙂

The solution utilizes esbuild's "external" option and a custom plugin. "external" is an array of packages you tell esbuild to not bundle, and the plugin picks up those external packages and installs them into a raw node_modules directory when building. OpenTelemetry needs to monkey patch methods, so you can't bundle the libraries it needs to instrument (e.g. http, express, graphql, etc.), but you can instrument everything else.

@ianwremmel
Copy link
Author

I'm already using rollup to do pretty aggressive treeshaking and I've tried putting all the otel packages on my dependencies layer. Unfortunately, as soon as I include the AWS Open Telemetry layer, cold boot times skyrocket.

I've made a few attempts at switch my lambdas over to mjs, but things go wrong with dependencies every time.

Note that the ping lambda I've been talking about above is about 40k after being processed by rollup.

@ianwremmel
Copy link
Author

came back to poke at this a little bit more and it seems like most of the overhead comes from adding http: to protocols in opentelemetry.yml. If I stick to only grpc:, then cold starts are only about 10% worse rather than 400% worse.

@ianwremmel
Copy link
Author

Another interesting detail. This line from the docs site, when used with a remix app, seems to mostly prevent the function from returning. From what I can tell, initialization takes more than 30 seconds. (For reasons not entirely clear to me, I couldn't get the lambda timeout to be more than 30 seconds, but that might be a sleep-deprivation issue on my part).

const tracerProvider = new NodeTracerProvider(tracerConfig);

@ianwremmel
Copy link
Author

Actually, that's the wrong line. I think it's this line:

import {NodeTracerProvider} from '@opentelemetry/node';

Apparently there's some side effect somewhere in @opentelemetry/node that completely tanks lambda initialization when Remix is involved.

@ianwremmel
Copy link
Author

For reasons not entirely clear to me, I couldn't get the lambda timeout to be more than 30 seconds, but that might be a sleep-deprivation issue on my part

It's the API gateway timeout that has hard limit at 30 seconds. Looking at the corresponding xray trace, it looks like the lambda takes 1.2 minutes to initialize when both remix and @opentelemetry/node are present.

Admittedly, remix is only the most obvious package in this lambda. There are any number of frontend packages that I could imagine might not play well with otel; the bundle file is 3MB.

@ianwremmel
Copy link
Author

After a bunch of refactoring to get AWSXRayIdGenerator and AWSXRayPropagator working, both of the problems I mentioned above seem to have gone away: the http receiver and @opentelementr/node + remix are no longer making initialization times skyrocket. I don't really know what meaningfully changed.

@willarmiros
Copy link
Contributor

After a bunch of refactoring to get AWSXRayIdGenerator and AWSXRayPropagator working

@ianwremmel why did you need to refactor to get these components to work? They should work out of the box with the Lambda layer instrumentation

@ianwremmel
Copy link
Author

ianwremmel commented Apr 12, 2022

@willarmiros I was having a whole lot of trouble getting the layer to actually collect my spans and send them anywhere.

First, the layer doesn't configure a few things that one would otherwise expect it to configure:

  • it only configures otel to talk to the collector via http, not grpc
  • it does not configure a ContextProvider

a side effect is that following, say, the aws otel javascript tracing guide has no effect, because there's already been a tracer configured by the layer. Since otel doesn't provide a way to get access to the original tracerProvider, there's no way to add grpc after the fact. (it took me a while, but once I actually found the aws and otel repos for the layer, I was able to configure things using the hooks provided by the upstream otel layer to address the issues I mentioned above.)

Second, again because there's no way to get the tracerProvider after initialization, I had to piggyback on one of the config hooks so I could attach the traceProvider to global so that I could call global.traceProvider.foceFlush() at the end of each of my lambda functions.

Finally, there have been a lot of API changes the majority of aws-otel.github.io was written, so they're all slightly incorrect.

In any case, after working through those issues, my original coldstart issue and the super high duration on my Remix lambdas went away.

This is what my final set of config overrides looks like

// Copied from
// https://github.com/open-telemetry/opentelemetry-lambda/blob/0a83149fe2f23a7dab64df6108cfa35f18cc2ae5/nodejs/packages/layer/src/wrapper.ts#L42-L50
declare global {
  // in case of downstream configuring span processors etc
  function configureTracerProvider(tracerProvider: NodeTracerProvider): void;

  function configureTracer(defaultConfig: NodeTracerConfig): NodeTracerConfig;

  function configureSdkRegistration(
    defaultSdkRegistration: SDKRegistrationConfig
  ): SDKRegistrationConfig;

  function configureLambdaInstrumentation(
    config: AwsLambdaInstrumentationConfig
  ): AwsLambdaInstrumentationConfig;
}

// And this needs to be defined so that we can flush at the end of each
// invocation.
declare global {
  // var seems to be the only typescript keyword that works here.
  // eslint-disable-next-line no-var
  var tracerProvider: NodeTracerProvider;
}

// Remove some noise from the service name
const serviceName = process.env.AWS_LAMBDA_FUNCTION_NAME?.replace(
  my-stack-name',
  ''
)?.split('-')?.[0];

// This produces the config passed to new NodeTracerProvider()
global.configureTracer = ({resource, ...config}) => {
  diag.debug('Telemetry: generating tracer config');

  if (serviceName) {
    const r = new Resource({
      [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
        process.env.STAGE_NAME ?? 'development',
      [SemanticResourceAttributes.FAAS_INSTANCE]:
        process.env.AWS_LAMBDA_LOG_STREAM_NAME ?? '',
      [SemanticResourceAttributes.FAAS_MAX_MEMORY]:
        process.env.AWS_LAMBDA_FUNCTION_MEMORY_SIZE ?? '',
      [SemanticResourceAttributes.SERVICE_NAME]: serviceName,
    });
    resource = resource ? resource.merge(r) : r;
  }

  return {
    ...config,
    idGenerator: new AWSXRayIdGenerator(),
    resource,
  };
};

// This gets applied to the tracerProvider
global.configureTracerProvider = (tracerProvider) => {
  diag.debug('Telemetry: configuring tracer provider');
  const exporter = new OTLPTraceExporter({
    url: 'grpc://localhost:4317',
  });

  tracerProvider.addSpanProcessor(new BatchSpanProcessor(exporter));

  global.tracerProvider = tracerProvider;
};

// This produces the config passed to tracerProvider.register();
global.configureSdkRegistration = (config) => {
  diag.debug('Telemetry: generating SDK registration config');
  const contextManager = new AsyncHooksContextManager();
  contextManager.enable();
  return {
    ...config,
    contextManager
  };
};

@ianwremmel
Copy link
Author

So, it wasn't really AWSXRayIdGenerator and AWSXRayPropagator that weren't working, it was just that the layer doesn't quite work out of the box. When I wrote the original comment, I thought the issue was with the AWS pieces, but it mostly the fact that the layer doesn't flush on its own.

@kmihaltsov
Copy link

Having similar performance issue. Nodejs Lambda with 1536mb: cold start goes from 500ms with x-ray sdk to 1700ms with this layer and default configuration from aws docs.

What makes me crazy is that aws documentation suggests to use this OTEL layer over x-ray while still having such issues.

@a-h
Copy link

a-h commented Jul 24, 2022

Similar situation with Go.

I was testing out following the advice to switch to Open Telemetry, spent a few hours on troubleshooting etc., only to find that the Lambda layer adds 200ms(!) to cold start times compared to using the X-Ray SDK.

Disappointed to see the advice to look at Open Telemetry, without any warning of the performance penalty.

The only reason I did was because I thought I could cut out some of the cold start time (the Go AWS X-Ray has about 6MB of dependency bloat for me).

With 1GB of RAM, I'm seeing init duration go from around 100ms, to 300ms.

I don't see how this is workable. 🤷🏻

I have a test Lambda function. Used CloudWatch Log Insights to compare before and after.

fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message /Init Duration. (?<@init>\d+\.\d+) ms/
| sort @timestamp desc
| limit 20

Screenshot 2022-07-24 at 01 19 18

(The first 50ms coming off the cold start time in the data is due to other optimisations).

@RichiCoder1
Copy link

@a-h It might be worth cross-posting that to https://github.com/open-telemetry/opentelemetry-lambda/ (which is the underlying repo) as that means the collector itself is taking that long to boot.

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

@github-actions github-actions bot added the stale label Oct 23, 2022
@RichiCoder1
Copy link

Still a very valid issue.

@ianwremmel
Copy link
Author

In fact, I think it’s gotten worse with the latest releases.

@adambiggs
Copy link

Any updates? Is this being looked into?

@mattfysh
Copy link

Achievement unlocked: It's not everyday you stumble upon a Github thread with screenshots of your Slack conversation!

I've been looking into this problem again and it does seem as though ADOT performance on cold starts is still rather poor, but my guess is that it's being caused by the underlying repo https://github.com/open-telemetry/opentelemetry-lambda

If anyone is interested, I'm putting together some automated benchmarks to find the optimal memory setting and implementation when tracing in lambda: https://github.com/mattfysh/lambda-trace-perf

I do find it ironic that we're seeing these bottlenecks. Tracing instrumentations are typically used to measure and monitor application performance, so it does give pause when I see vendors are not fine-tuning their own shared libraries and runtimes required by developers.

@ianwremmel
Copy link
Author

I do find it ironic that we're seeing these bottlenecks.

Agreed. Honestly, I think the real solution here is "stop running the collector in a Lambda layer". I haven't looked into it yet, but I've been noodling on how to run the collector in ECS. I did some manual benchmarking a few weeks ago (I don't have the numbers anymore) and just adding the ADOT layer without even setting the env vars to induce instrumentation adds nearly 800ms.

Somewhere, I heard that AWS was planning to make xray more OTel native, so hopefully someday it'll be possible to just let the XRay infrastructure do the collection.

@RichiCoder1
Copy link

RichiCoder1 commented Mar 25, 2023

I think it's still worth have an "agent" in a layer to take advantage of things like the Telemetry API and handling batch. That said, that agent doesn't necessarily need to be the Otel collector. Maybe something more purpose built would make more sense like a very thin rust agent to collect logs, traces, and batch and forward otel data?

(I promise I'm not mentioning rust cuz it's the cool kid, but rather because real world perf in lambda: https://maxday.github.io/lambda-perf/).

Edit: A native, AWS/Lambda level support for Otel would def be the best solution so there's no (user visible) overhead whatsoever.

@ianwremmel
Copy link
Author

yea, there are certainly advantages, but at least for my purposes, the overhead of using any layer seems to be too much of a performance impact to justify using that layer. I've jumped through a lot of goofy hoops to remove multiple layers. From what I can tell, layers only have "good" performance when you hit a scale where coldstarts are an adequately small portion of your traffic that they average their way off of your dashboards.

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

@github-actions github-actions bot added the stale label Jun 25, 2023
@RichiCoder1
Copy link

Bump. As far as I'm aware, this issue hasn't been resolved even with the introduction of the new lambda repo.

@mattfysh
Copy link

mattfysh commented Jul 7, 2023

Performance of OTel in Lambda is receiving attention again. If anyone would like to benchmark the current solutions I put together a rudimentary repo here: https://github.com/mattfysh/lambda-trace-perf

I've not looked at it in a while (I've since moved away from serverless because of various cold boot issues, including this one) but if anyone wants to work together on it, I'd be happy to jump back into it

@deki
Copy link

deki commented Sep 6, 2023

See also: open-telemetry/opentelemetry-lambda#727

@adambiggs
Copy link

Any updates?

@jnicholls
Copy link

jnicholls commented Jan 23, 2024

I'm going to dig into this a little bit this week to try to understand where the startup time is rooted in ADOT's setup. I'll get a baseline with zero components registered and introduce them one by one.

If this isn't resolved by the team, I'm going to resort to running a central collector to which my lambdas export, and avoid a collector as a layer altogether. Until the Lambda runtime makes an OTel collector a first-class citizen like the X-Ray daemon is, I suspect an OTel collector is simply too heavyweight to be a layer.

Unless of course, someone decides to write an OTel collector in Rust. Which would be substantially better imho.

Copy link

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

@github-actions github-actions bot added the stale label Apr 28, 2024
@RichiCoder1
Copy link

This is still a concern as far as I'm aware.

@github-actions github-actions bot removed the stale label May 5, 2024
@stevemao
Copy link

This is a blocking issue for us to use opentelemetry in lambda

@felipenervo
Copy link

Any updates on this issue? We had to remove the layer from our lambdas due to these issues.

@jnicholls
Copy link

I too removed the layer. It was a nonstarter. An external collector makes more sense for Lambda at this time.

@ianwremmel
Copy link
Author

@jnicholls would you be able to share anything about your external collector setup? I've been interested in doing something like that, but the pricing has always looked prohibitive.

Copy link

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

@github-actions github-actions bot added the stale label Sep 22, 2024
@ianwremmel
Copy link
Author

This is still a major issue and it's causing me to run new workloads on fly.io instead of lambda.

@github-actions github-actions bot removed the stale label Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests