Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS lambda - process is killed before the span is being exported #1295

Closed
obecny opened this issue Jul 9, 2020 · 17 comments · Fixed by #1296
Closed

AWS lambda - process is killed before the span is being exported #1295

obecny opened this issue Jul 9, 2020 · 17 comments · Fixed by #1296
Assignees
Labels
bug Something isn't working Discussion Issue or PR that needs/is extended discussion.

Comments

@obecny
Copy link
Member

obecny commented Jul 9, 2020

This is a simple scenario of lambda function

exports.handler = ( async ( event, ctx ) => {
    const span = tracer.startSpan('test');
    span.end();
    return 'done';
} );

The span is not exported correctly as lambda seems to kill the process before it gets exported
When adding some timeout before the process get killed the span iss exported

exports.handler = ( async ( event, ctx ) => {
    const span = tracer.startSpan('test');
    span.end();
    await new Promise((resolve)=> {
        setTimeout(()=> {
            resolve();
        }, 500);
    })
    
    return 'done';
} );

So the idea is to implement some async method either on tracer or exporter or change the exporter method shutdown to be a promise and to be resolved when the last exports is sent successfully

WDYT ?

@obecny obecny added bug Something isn't working Discussion Issue or PR that needs/is extended discussion. labels Jul 9, 2020
@obecny
Copy link
Member Author

obecny commented Jul 9, 2020

@open-telemetry/javascript-maintainers && @open-telemetry/javascript-approvers

@dyladan
Copy link
Member

dyladan commented Jul 9, 2020

I think this can be solved by adding a callback to the forceFlush. This will help us in other shutdown scenarios like described in #1290.

@obecny obecny self-assigned this Jul 9, 2020
@lykkin
Copy link
Contributor

lykkin commented Jul 9, 2020

FWIW, when I was working on something similar to this it was required to have a synchronous export pathway for the data to escape through. Lambda will freeze execution of the process after the lambda has been marked as complete (i.e. the response callback was called, the promise returned by the handler settles, or the process attempts to exit due to inactivity). In the case where the response callback is called, the async work that was pending will start back up the next time the process is unfrozen, which will result in unpredictable export timings. @michaelgoin or @astorm might also have more context to add to this.

@dyladan
Copy link
Member

dyladan commented Jul 9, 2020

The behavior when the callback is called is dictated by the context.callbackWaitsForEmptyEventLoop property. If it is true, then any work on the event loop (like a flush) will keep the invocation running until it completes (or times out). If it is false, the execution environment is frozen immediately.

When using an async handler, the function is frozen immediately when the returned promise resolves/rejects. It may or may not be reused. If it is, async work continues to process.

In both cases, if the user calls force flush in their handler code and only finishes the handler after the force flush is complete, then this is fixed.

note: it may be tempting to call the handler callback early and have the force flush continue in the background, but the lambda runtime actually doesn't return the response to the user until the whole function invocation is complete. This is a very frustrating situation, but it is true. For instance, the following handler for an API gateway request will return the response to the user in ~500ms rather than ~0ms as you would expect:

module.exports.handler = (event, context, cb) => {
  cb(); // return immediately
  setTimeout(() => {}, 500); // keep tasks on the event loop for 500ms
};
time curl ...
curl ...  0.00s user 0.01s system 2% cpu 0.621 total

@obecny
Copy link
Member Author

obecny commented Jul 9, 2020

@dyladan do you know if there is a way to get access to this callbackWaitsForEmptyEventLoop function or somehow patch the whole flow so that we could have a plugin that does that automatically and waits until all the exporters are shutdown correctly ?

@obecny
Copy link
Member Author

obecny commented Jul 9, 2020

or maybe we can patch the whole exports.handler

@lykkin
Copy link
Contributor

lykkin commented Jul 9, 2020

@obecny callbackWaitsForEmptyEventLoop is just a property on the context parameter passed into the handler, so it would suffice to have a method that sets it for a user, if we are worried about a user misconfiguring it. Depending on how much (or how little) code we expect the user to write, we could also patch the handler like you said. Something like

function patchLambdaHandler(handler) {
  return async function wrappedHandler(event, context, cb) {
    context.callbackWaitsForEmptyEventLoop = true; // this could still be overwritten by the user
    const res = handler(event, context, cb);  // should be wrapped in a try catch, technically.
    await otel.forceFlush();
    return res;
  }
}

This does potentially modify the execution semantics for the lambda from under the user, but it would be roughly what we would want to do.

@dyladan
Copy link
Member

dyladan commented Jul 9, 2020

In an autoinstrumentation I would suggest you either:

  1. patch the handler, and prevent the wrapped handler from finishing until the export is complete
  2. patch the underlying runtime itself. This is more risky, but it turns out the API of the RAPIDClient (their internal name for the runtime environment) is much simpler and more straightforward. This is going to be what I do for AWS Lambda - http opentelemetry-js-contrib#132

For this issue, I think it is sufficient to give the user a way to manually flush and not worry about wrapping handler or patching anything.

@dyladan
Copy link
Member

dyladan commented Jul 9, 2020

  1. patch the underlying runtime itself. This is more risky, but it turns out the API of the RAPIDClient (their internal name for the runtime environment) is much simpler and more straightforward. This is going to be what I do for AWS Lambda - http opentelemetry-js-contrib#132

Actually I have already been working on this and will have a PR shortly.

@dyladan
Copy link
Member

dyladan commented Jul 9, 2020

@obecny if you have not started working on this, i actually already have an implementation of forceFlush callbacks that I can open a PR with

@dyladan
Copy link
Member

dyladan commented Jul 9, 2020

I'll just open the PR, let me know if i should close it.

@obecny
Copy link
Member Author

obecny commented Jul 9, 2020

@dyladan great, no I haven't started working on that. I think the final fix / solution will need to do probably several things so I will wait until you make PR to see what you have discovered and done :) .

@dyladan
Copy link
Member

dyladan commented Jul 9, 2020

Ok. Here it is #1296

All this PR does is provide a mechanism to wait for the flush to complete. The lambda plugin PR will be coming later. Probably early next week.

@obecny obecny assigned dyladan and unassigned obecny Jul 9, 2020
@longility
Copy link
Contributor

The lambda plugin PR will be coming later. Probably early next week.

@dyladan Do you have a reference to this lambda plugin or guidance on aws lambda implementation? I'm noticing that my lambda is not exporting the spans.

@dyladan
Copy link
Member

dyladan commented Aug 27, 2021

Actually someone else ended up writing the lambda plugin so I never had to :)

You can find it in the contrib repo.

@longility
Copy link
Contributor

Weird, I have that in there but something isn't working still. Let me see if I can dig deeper. Feel free to share any debug techniques or what to look for in the logs.

@dyladan
Copy link
Member

dyladan commented Aug 27, 2021

I would open an issue on the contrib repo and ping @willarmiros or @anuraaga since they maintain that instrumentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Discussion Issue or PR that needs/is extended discussion.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants