Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow start-up? #263

Open
a-h opened this issue Jul 24, 2022 · 27 comments
Open

Slow start-up? #263

a-h opened this issue Jul 24, 2022 · 27 comments

Comments

@a-h
Copy link
Contributor

a-h commented Jul 24, 2022

I tried out migrating away from AWS's X-Ray SDK for Lambda, but the Open Telemetry Lambda layer appears to add a significant amount to cold start time, which I didn't expect.

It was suggested I cross post, since this is actually the repo with the layers in.

aws-observability/aws-otel-lambda#228 (comment)

Here's the data for reference:

Screenshot 2022-07-24 at 01 19 18

I don't see any documentation on performance, comparison to X-Ray performance etc.

Is there a plan to reduce this? I didn't expect to have to add a Lambda layer to get OpenTelemetry working, I thought it would be included in the Lambda runtime as a first class thing, rather than being a sort-of add-on.

@mhausenblas
Copy link
Member

Hi, ADOT PM here. Thanks a lot, we're already in the process to dive deep on this and I will report back once we have some shareable data (ETA: early August 2022).

@adambartholomew
Copy link

Is there an expected startup time for the ADOT collector and instrumentation? Running the latest nodejs layer (1.6.0:1) I am still witnessing 2+ second startups. Tested with both 1024MB and 1536MB memory.

Same test requests go from 3 seconds to 100ms after warming up:
image

Sample initialization:
image

@mhausenblas
Copy link
Member

@adambartholomew we've identified the issue with cold starts and are considering ways to address this. Thanks for sharing your data points and we currently do not publish expected startup times.

@sam-goodwin
Copy link

Any update on this? Is there an eta for a fix? Can we expect a solution that is comparable to native EMF?

@adambiggs
Copy link

Also eager to hear any updates. Is there any workarounds in the meantime?

@Sutty100
Copy link

Sutty100 commented Dec 21, 2022

AWS Snapstart would seem to remove this as a problem. Benchmarking I've done shows cold starts are largely removed as a factor when using SnapStart

@a-h
Copy link
Contributor Author

a-h commented Dec 21, 2022

@Sutty100 - SnapStart is Java only, and specifically only java11, so it doesn't solve anything for most people using Lambda.

@RichiCoder1
Copy link

RichiCoder1 commented Dec 21, 2022

AWS Snapstart would seem to remove this as a problem. Benchmarking I've done shows cold starts are largely removed as a factor when using SnapStart

I believe even if/when it's expanded too, it doesn't currently support or address Lambda Extensions so it wouldn't benefit this issue right now either.

@bilalq
Copy link

bilalq commented Jan 29, 2023

Related issue in aws-otel-lambda repo: aws-observability/aws-otel-lambda#228

@RichiCoder1
Copy link

@mhausenblas not to be too noisy, but is there any update to this? Or a plan to provide an update? This makes using the Otel layer close to a no-go for a number of latency sensitive cases.

@mhausenblas
Copy link
Member

@RichiCoder1 no problem at all, yes we're working on it and should be able to share details soon. Overall, our plan is to address the issues in Q1, what we need to verify is to what extent.

@tsloughter
Copy link
Member

To give some feedback, I believe this is believed to be due to auto-instrumentation. So you may be able to improve your startup now by building your own, narrower, layer.

@a-h
Copy link
Contributor Author

a-h commented Mar 3, 2023

@tsloughter - do you mean "the 200ms cold start time is caused by auto-instrumentation"?

I can't see how that could be the case. Since https://opentelemetry.io/docs/instrumentation/go/libraries/ says:

Go does not support truly automatic instrumentation like other languages today.

And the Lambda layer is written in Go.

@tsloughter
Copy link
Member

@a-h ah, I didn't see any mention of the language in use. You are right, in Go there is no auto instrumentation.

@disfluxly
Copy link

Hey @mhausenblas - any updates on the timeline for this by chance?

@sangalli
Copy link

I was trying to use ADOT with lambda for nodejs+NestJS, but the auto-instrumentation performed by ADOT was adding seconds to the cold start time. @mhausenblas, please let us know if you have any updates on the timeline for this issue.

@ithompson-gp
Copy link

ithompson-gp commented Jul 1, 2023

Hi, in out tests we are seeing issues with invocation slow start due to Collector extension registration (~800-2000 ms) and, on emit (POST) of telemetry from the function invocation towards the Collector extension (~200-450 ms).

The initialisation duration will, of course, drop on subsequent invocations but; the POST latency (the ~200 ms) will always remain for all invocations.

Screenshot 2023-07-01 at 15 51 16

Is there are news/update on remedies for this @mhausenblas ?

(is there any suggestion from AWS on the best course of action here with Lambda; is emit via the OTEL SDK [no local Agent] to a central Collector seen as a better go-forward?)

Thanks

@rapphil
Copy link
Contributor

rapphil commented Jul 6, 2023

Hi, how are you measuring the latency for the subsequent invocations after the initialization? Is POST the http verb or is it something else?

Since you have a test setup in place, what is the latency when you don't use a layer?

@a-h
Copy link
Contributor Author

a-h commented Oct 31, 2023

Hi @rapphil, didn't see your message in July. On the screenshot, there's a red line. Above the line is when I added the Otel layer, and the cold start increased from around 100ms to 300ms.

@silpamittapalli
Copy link

Hi there, We use lambda serverless workloads in Financial Services with tight execution time SLAs, which makes the overhead caused by introducing AWS ADOT or a custom extension layer for OTel SDK or OTel collector unacceptable. We are trying to cut down the cold start time by minimizing layers and using just the SDK without the collector, but looks like we won't be able to cut down the overhead to reach an acceptable level.

If others have run into similar challenges, I'd be interested in learning how you are able to workaround this and still collect distributed traces for such workloads. Thanks

@sam-goodwin
Copy link

@silpamittapalli the BaseLime folks have already tried to strip this down as much as possible and package as two dependencies:

It still has dependencies on these libraries though: But worth looking at. https://github.com/baselime/node-opentelemetry/blob/b3331d5040bf35ca633c3634c186a2a5304a201d/package.json#L61-L68

I think a full re-write is in order. Should be a concise Js library optimized for ESM bundling.

@Ankcorn
Copy link

Ankcorn commented Apr 2, 2024

Thanks for the shout-out @sam-goodwin

We can make it smaller but opted not to make some changes we knew we could not upstream to keep things maintainable. As it stands our OTEL setup, including the extension (so we have 0 runtime latency overhead) adds around 180ms of coldstart.

I think its possible to get sub 100ms coldstarts whilst still being based on OpenTelemetry.

There are a few dependencies that can be patched or cut that would not change behavior too much for most use cases and the I'm sure some other bits could be slimmed down a bit

@silpamittapalli if you want to chat through your use case I'd be happy to help with this :)

On doing a complete optimized rewrite - it's easy to underestimate how much work has gone into OTEL and how much it provides. It is a general solution though so not optimized for lambda or other environments that prioritize a quick startup.

Here is our bundle - it's easy to see how much we have done vs how much we rely on the work by the Opentelemetry team.

image

There are some quick wins in there like semver could be replaced with something purpose built and just a few kb and semantic attributes could be tree shaken better. I suspect resources and the resource detectors can also be improved too. but then the rest will be quite hard

@bhaskarbanerjee
Copy link

Has any memory profiling been done for this lambda layer? Any recco with Node v12, v16, v18 or v20? How much is the memory overhead for using this layer?

@silpamittapalli
Copy link

Thank you @sam-goodwin @Ankcorn. @bhaskarbanerjee from my team tried out Baselime but we haven't had any success with it yet which is probably bcoz it is customized for their proprietary software. We are trying out few other approaches 1) manual instrumentation to eliminate layers altogether and 2) minimizing SDK and/or layer by stripping unused code/dependencies

@bhaskarbanerjee
Copy link

Has anyone here used protobuf/http exporter and compared the performance with that of grpc exporter?
Both for Lambda cold start time and response time?

Ref https://github.com/open-telemetry/opentelemetry-lambda/blob/main/nodejs/packages/layer/src/wrapper.ts#L24
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto' seems to be very fast but
if we do import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc' that seems to be taking atleast 100ms more. Seeking your advice.

@stevemao
Copy link

This is a blocking issue for us to use opentelemetry in lambda

@andrei-cdl
Copy link

Currently similar boat here, the primary use-case of lambda's previously was async work. Some systems utilize API's directly behind API Gateway's now and have a full serverless(Lambda) powered API. In this case the additional ~400ms for the OTEL Lambda layer isn't really acceptable due to service->service calls which yes, is not a good practice but that's a lot of re-work to fix it. This results in potentially 1.5+ seconds easily on-top of the standard cold start values already looking anywhere between 3-5 seconds.

There's some workarounds like provision concurrency (but that's gonna burn $ fast). I'm attempting a PoC to instrument the lambda just with the OTEL SDK and having a collector in Fargate to get those signals. Unfortunately this does mean the lambda would need to run in a VPC which also effects cold starts but not as dramatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests