-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow start-up? #263
Comments
Hi, ADOT PM here. Thanks a lot, we're already in the process to dive deep on this and I will report back once we have some shareable data (ETA: early August 2022). |
@adambartholomew we've identified the issue with cold starts and are considering ways to address this. Thanks for sharing your data points and we currently do not publish expected startup times. |
Any update on this? Is there an eta for a fix? Can we expect a solution that is comparable to native EMF? |
Also eager to hear any updates. Is there any workarounds in the meantime? |
AWS Snapstart would seem to remove this as a problem. Benchmarking I've done shows cold starts are largely removed as a factor when using SnapStart |
@Sutty100 - SnapStart is Java only, and specifically only |
I believe even if/when it's expanded too, it doesn't currently support or address Lambda Extensions so it wouldn't benefit this issue right now either. |
Related issue in aws-otel-lambda repo: aws-observability/aws-otel-lambda#228 |
@mhausenblas not to be too noisy, but is there any update to this? Or a plan to provide an update? This makes using the Otel layer close to a no-go for a number of latency sensitive cases. |
@RichiCoder1 no problem at all, yes we're working on it and should be able to share details soon. Overall, our plan is to address the issues in Q1, what we need to verify is to what extent. |
To give some feedback, I believe this is believed to be due to auto-instrumentation. So you may be able to improve your startup now by building your own, narrower, layer. |
@tsloughter - do you mean "the 200ms cold start time is caused by auto-instrumentation"? I can't see how that could be the case. Since https://opentelemetry.io/docs/instrumentation/go/libraries/ says:
And the Lambda layer is written in Go. |
@a-h ah, I didn't see any mention of the language in use. You are right, in Go there is no auto instrumentation. |
Hey @mhausenblas - any updates on the timeline for this by chance? |
I was trying to use ADOT with lambda for nodejs+NestJS, but the auto-instrumentation performed by ADOT was adding seconds to the cold start time. @mhausenblas, please let us know if you have any updates on the timeline for this issue. |
Hi, in out tests we are seeing issues with invocation slow start due to Collector extension registration (~800-2000 ms) and, on emit ( The initialisation duration will, of course, drop on subsequent invocations but; the Is there are news/update on remedies for this @mhausenblas ? (is there any suggestion from AWS on the best course of action here with Lambda; is emit via the OTEL SDK [no local Agent] to a central Collector seen as a better go-forward?) Thanks |
Hi, how are you measuring the latency for the subsequent invocations after the initialization? Is Since you have a test setup in place, what is the latency when you don't use a layer? |
Hi @rapphil, didn't see your message in July. On the screenshot, there's a red line. Above the line is when I added the Otel layer, and the cold start increased from around 100ms to 300ms. |
Hi there, We use lambda serverless workloads in Financial Services with tight execution time SLAs, which makes the overhead caused by introducing AWS ADOT or a custom extension layer for OTel SDK or OTel collector unacceptable. We are trying to cut down the cold start time by minimizing layers and using just the SDK without the collector, but looks like we won't be able to cut down the overhead to reach an acceptable level. If others have run into similar challenges, I'd be interested in learning how you are able to workaround this and still collect distributed traces for such workloads. Thanks |
@silpamittapalli the BaseLime folks have already tried to strip this down as much as possible and package as two dependencies: It still has dependencies on these libraries though: But worth looking at. https://github.com/baselime/node-opentelemetry/blob/b3331d5040bf35ca633c3634c186a2a5304a201d/package.json#L61-L68 I think a full re-write is in order. Should be a concise Js library optimized for ESM bundling. |
Thanks for the shout-out @sam-goodwin We can make it smaller but opted not to make some changes we knew we could not upstream to keep things maintainable. As it stands our OTEL setup, including the extension (so we have 0 runtime latency overhead) adds around 180ms of coldstart. I think its possible to get sub 100ms coldstarts whilst still being based on OpenTelemetry. There are a few dependencies that can be patched or cut that would not change behavior too much for most use cases and the I'm sure some other bits could be slimmed down a bit @silpamittapalli if you want to chat through your use case I'd be happy to help with this :) On doing a complete optimized rewrite - it's easy to underestimate how much work has gone into OTEL and how much it provides. It is a general solution though so not optimized for lambda or other environments that prioritize a quick startup. Here is our bundle - it's easy to see how much we have done vs how much we rely on the work by the Opentelemetry team. There are some quick wins in there like semver could be replaced with something purpose built and just a few kb and semantic attributes could be tree shaken better. I suspect resources and the resource detectors can also be improved too. but then the rest will be quite hard |
Has any memory profiling been done for this lambda layer? Any recco with Node v12, v16, v18 or v20? How much is the memory overhead for using this layer? |
Thank you @sam-goodwin @Ankcorn. @bhaskarbanerjee from my team tried out Baselime but we haven't had any success with it yet which is probably bcoz it is customized for their proprietary software. We are trying out few other approaches 1) manual instrumentation to eliminate layers altogether and 2) minimizing SDK and/or layer by stripping unused code/dependencies |
Has anyone here used Ref https://github.com/open-telemetry/opentelemetry-lambda/blob/main/nodejs/packages/layer/src/wrapper.ts#L24 |
This is a blocking issue for us to use opentelemetry in lambda |
Currently similar boat here, the primary use-case of lambda's previously was async work. Some systems utilize API's directly behind API Gateway's now and have a full serverless(Lambda) powered API. In this case the additional ~400ms for the OTEL Lambda layer isn't really acceptable due to service->service calls which yes, is not a good practice but that's a lot of re-work to fix it. This results in potentially 1.5+ seconds easily on-top of the standard cold start values already looking anywhere between 3-5 seconds. There's some workarounds like provision concurrency (but that's gonna burn $ fast). I'm attempting a PoC to instrument the lambda just with the OTEL SDK and having a collector in Fargate to get those signals. Unfortunately this does mean the lambda would need to run in a VPC which also effects cold starts but not as dramatically. |
I tried out migrating away from AWS's X-Ray SDK for Lambda, but the Open Telemetry Lambda layer appears to add a significant amount to cold start time, which I didn't expect.
It was suggested I cross post, since this is actually the repo with the layers in.
aws-observability/aws-otel-lambda#228 (comment)
Here's the data for reference:
I don't see any documentation on performance, comparison to X-Ray performance etc.
Is there a plan to reduce this? I didn't expect to have to add a Lambda layer to get OpenTelemetry working, I thought it would be included in the Lambda runtime as a first class thing, rather than being a sort-of add-on.
The text was updated successfully, but these errors were encountered: