Lambda support appears to be broken in v2.x #827

watson · 2019-01-24T22:00:26Z

See https://discuss.elastic.co/t/apm-server-transport-error-error-write-after-end/165484 for details

alvarolobato · 2019-02-05T16:21:16Z

@Qard let's timebox some investigation on this and see if we get to an easy fix.

alvarolobato · 2019-02-05T16:28:48Z

Related to #798

davebond · 2019-02-28T15:39:16Z

Has there been any progress on this issue? I can submit our workaround as a PR for this but I'm not sure thats its the best long term solution

Qard · 2019-02-28T18:51:04Z

I was unable to reproduce it in my initial testing. I did do some restructuring recently to support async/await style though, which has not yet been released. That might impact it.

davebond · 2019-02-28T19:12:30Z

Unfortunately it is still an issue, from my poking around the cause is:

Client connects transport to APM and functions for the first lambda execution
Lambda is "frozen", during which the socket is interrupted
On next lambda invoke, event loop kicks back in and transport throws an ECONNRESET
Client attempts to write transaction data to the socket which is now closed due to the error state, resulting in the write after end issue

Our workaround is to track this error state and ensure that the transport is open before sending any transactions: status200@629cd9f

Though it seems like a larger restructure to track the lifecycle and state of the transport would be more appropriate as this could also be an issue for clients running long enough to encounter network hiccups.

Qard · 2019-03-02T04:14:09Z

I'm able to reproduce the ECONNRESET, but the client just reconnects and seems to send the new transaction just fine. I'm not getting the write after end error at all.

Are you able to provide a code example that can reproduce the write after end?

HuddleHouse · 2019-03-08T19:14:02Z

For what it is worth, I got it working by editing the code from the docs. I have node8.10 on Lambda.

What the docs show and caused "APM Server transport error: Error: write after end":

exports.handler = apm.lambda(function handler (payload, context, callback) {
  callback(null, `Hello, ${payload.name}!`)
})

What worked for me:

exports.handler = apm.lambda(async (payload, context, callback) => {
  callback(null, `Hello, ${payload.name}!`)
})

davebond · 2019-03-08T20:35:08Z

Sorry for the delay, @HuddleHouse is correct, the example from the docs still produces the issue.

There are two methods I can see which stop this:

// Pass promise to lambda
exports.handler = apm.lambda(async (payload, context, callback) => {
  callback(null, `Hello, ${payload.name}!`)
})

// callbackWaitsForEmptyEventLoop
exports.handler = apm.lambda(function handler (payload, context, callback) {
  context.callbackWaitsForEmptyEventLoop = false;
  callback(null, `Hello, ${payload.name}!`)
})

Both of these depend on Lambda's handling of the event loop it seems and I wouldn't like to count on this behaviour staying consistent over time.

tallakh · 2019-04-15T11:10:27Z

I still get this error when using the latest apm in a lambda. I start apm like this:

exports.handler = apm.lambda(async (event, context, callback) => {
	...
	callback(null, `Successfully processed ${items} events`);
}

2019-04-15T10:48:16.268Z 4f47d833-61f4-4af4-b824-cf46e5d0880d APM Server transport error: Error: write after end
at writeAfterEnd (/var/task/node_modules/readable-stream/lib/_stream_writable.js:257:12)
at Client.Writable.write (/var/task/node_modules/readable-stream/lib/_stream_writable.js:301:21)
at Client.sendTransaction (/var/task/node_modules/elastic-apm-http-client/index.js:276:15)
at Instrumentation.addEndedTransaction (/var/task/node_modules/elastic-apm-node/lib/instrumentation/index.js:185:22)
at Instrumentation.wrappedAddEndedTransaction [as addEndedTransaction] (/var/task/node_modules/elastic-apm-node/lib/instrumentation/async-hooks.js:53:34)
at Transaction.end (/var/task/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:218:32)
at captureAndMakeCompleter (/var/task/node_modules/elastic-apm-node/lib/lambda.js:48:11)
at wrappedLambdaCallback (/var/task/node_modules/elastic-apm-node/lib/lambda.js:60:20)
at exports.handler.apm.lambda (/var/task/index.js:38:9)
at
at process._tickDomainCallback (internal/process/next_tick.js:228:7)

Qard · 2019-04-15T21:20:49Z

Yes, it's still a known issue. It's proving to be complicated to fix. Basically, when Lambda freezes the VM it disconnects any sockets still in-use and, when the process wakes up again, it's unaware the sockets have been closed until after it tries to write to them. This is resulting in the write after end happening when the next message is sent to the apm-server.

watson · 2019-04-15T21:30:07Z

I've been thinking that we need a way to either force close the underlying TCP socket after each lambda function execution (though that would increase the function running-time overhead), or we should somehow ignore those errors when running in a lambda context (though that might hide some real errors).

We should probably also update the docs do suggest that you disable metrics when running in lambda as that will keep trying to reopen the TCP socket - and metrics inside a lambda doesn't make much sense anyway.

martinduris · 2019-04-20T15:27:30Z

Hi all,
so.... current situation is - for everybody reading this issue (please correct me):

Do NOT upgrade/use elastic stack of version 7.x, because elastic-apm-nodejs of version 1.x is not supported with APM of version 7.x

If using elastic-apm-nodejs in AWS lambda, use elastic stack of version < 6.5.x and elastic-apm-node-js of version < 2.x

Martin

willbrazil · 2019-05-29T10:40:50Z

Hi @watson,

We should probably also update the docs do suggest that you disable metrics when running in lambda as that will keep trying to reopen the TCP socket

We've tried disabling metrics by passing the metricsInterval config option but we still experience the issue. Do you have other suggestions we may try?

We're running server v7.0.0 with node client ^2.11.5.

Thank you so much for all the work here!

watson · 2019-05-29T13:25:01Z

@willbrazil Setting metricsInterval: 0 will only reduce the likelihood that this error appears, not completely eliminate it. As it stands currently, you might see this error when the VM is being resumed after having been suspended by AWS Lambda. Unfortunately, we can't detect when this happens. However, all data being sent afterward, while the VM is still running, should make it to the APM Server successfully.

If your lambda function is only called once every 20 minutes, this might mean nothing or only very little data get through. But if your lambda function is called several times a minute or more, 99+% of the collected data should get through as expected.

willbrazil · 2019-05-29T14:30:26Z

Thank you for such a quick response @watson! Some of our lambdas are handling ~60tps.

We're reproduced in a more controlled environment where we sent a request every second and still experience the issue unfortunately. We'll keep digging here. I know this is a tricky one specially because it's hard to repro locally.

Thank you again for all the info!

24601 · 2020-02-16T22:43:54Z

Any thoughts on the wisdom of the approach New Relic took (CloudWatch log ingestion basically) versus writing a transport class that write the serialized version of the message to EventBridge or SQS before the lambda returns and a worker or a lambda picks up the serialized form and then batch sends?

I suspect performance penalty in this versus CW which is highly optimized is non-trivial (awaiting even a publish to EB or SQS is probably much longer than a console.log which is likely highly optimized to exit quickly and then get handled later), and probably cheaper than paying the EB event and/or SQS message tax.

Anyone with more knowledge here care to confirm or disagree with my thoughts on the approach? We are considering writing a solution and would be happy to PR it/work on a collective effort if it makes sense.

davebond · 2020-02-17T10:55:30Z

Sorry for my absence here :)

@Qard There's not much I could share, the only thing that isn't specific to our usage is calling sendTransaction/sendSpan/sendError on the elastic-apm-http-client package. @jonminter looks to be on the right track with it.

@24601 You're right, using anything other than the stdio stream to cloudwatch incurs the network tax, adding either extra latency to each lambda invoke or the possibility of dropped transactions.

New Relic's solution was a pain to implement in an enterprise environment, they provide a python script which sets everything up, which sounds like an alright idea until:

Security team vetoes running a third party python script against the production AWs account
Ops team vetoes using a python interface outside of Terraform to manage the APM (with good reason)
I had to pull it apart and reimplement everything above the lambda code itself to get it deployed.

I think publishing an app to SAR would be the cleanest approach:

Allow the app the be namespaced on deployment (one of the issues with New Relic)
Output the lambda ARN from the deployed stack, so that the cloudwatch events can be attached by the parent orchestration
This could be deployed manually or by cloudformation / terraform (terraform would need to invoke cloudformation [1] as there is no direct support)

[0] https://aws.amazon.com/serverless/serverlessrepo/
[1] hashicorp/terraform-provider-aws#3981

chemalopezp · 2020-05-07T17:18:13Z

@Qard @watson in case it helps with the prioritization, we are new users thus we have elastic v7 and thus it seems we need to use the latest APM. Most of our infrastructure is on lambda, so we it seems we reached a dead end here until APM supports lambda again. Do you have any date on when this will happen?

basepi · 2020-05-08T15:29:45Z

@chemalopezp No timeline yet, but we currently have an active working group designing our future serverless support. Stay tuned!

chemalopezp · 2020-05-08T16:54:50Z

Thank you @basepi, those are great news. I'll keep an eye for updates here ;)

spectorar · 2020-05-28T19:03:53Z

Similar to the lambda -> cloudwatch -> lambda -> elasticsearch approach, a lambda -> cloudwatch -> kinesis -> functionbeat -> elasticsearch flow could work nicely as well, if we could somehow configure our existing functionbeat to handle the APM messages.

chemalopezp · 2020-05-28T19:18:39Z

@spectorar that would be great. Currently we are able to stream logs from lambda -> clodwatch -> functionbeat lambda -> elasticsearch (even without kinesis the functionbeat lambda seems to work just fine). I really like this approach, that could be used for anything streaming into a cloudwatch log (the functionbeat lambda simply subscribes to that log group and streams into elasticsearch)

chemalopezp · 2020-06-01T21:00:14Z

@basepi do you have by chance any news regarding the AWS Lambda support for the APM agent? I can imagine the actual implementation might take much longer but it would be great to know that will eventually be supported in the near term. Thanks!

basepi · 2020-06-02T15:09:53Z

No ETA, but we have an active working group finalizing the design for AWS.

chemalopezp · 2020-07-09T16:08:45Z

Hi @basepi ! Any updates regarding the APM support for AWS Lambda? Maybe some scope of work and estimated date, or at least if there was a decision to support it to some extent? Thank you very much!

basepi · 2020-07-09T17:17:08Z

I think I can safely say we are planning to support it, but we don't have anything to share yet on roadmap timing.

chemalopezp · 2020-08-20T01:06:55Z

Hi @basepi I hope you don't mind bugging you again! Is there by chance any update about APM support for AWS Lambda? Thanks!

basepi · 2020-08-20T16:01:46Z

No update at the moment. Sorry for the slow progress here, between summer vacations and some additional testing we still haven't settled on the design.

basepi · 2020-10-02T19:36:08Z

Quick update here -- while we have a pretty good idea of the design, I don't know when the actual development work will reach the top of the roadmap. As a result, I still can't give any clear idea of when we'll support AWS lambda.

@chemalopezp if you're still interested in this, would you mind opening an issue in https://github.com/elastic/apm/issues ? That repo is a good place for cross-agent feature requests and is used heavily for roadmapping and gauging community interest. Thanks!

trentm · 2020-10-30T17:05:23Z

I'm closing this in favour of elastic/apm#352.

watson added the bug label Jan 24, 2019

alvarolobato added the [zube]: Ready label Feb 5, 2019

Qard added [zube]: In Progress and removed [zube]: Ready labels Feb 11, 2019

Qard self-assigned this Feb 11, 2019

alvarolobato unassigned Qard Feb 19, 2019

alvarolobato added [zube]: Ready and removed [zube]: In Progress labels Feb 19, 2019

alvarolobato assigned Qard Feb 26, 2019

alvarolobato added [zube]: In Review and removed [zube]: Ready labels Feb 26, 2019

Qard added [zube]: Backlog and removed [zube]: In Progress labels Mar 27, 2019

stevensnoeijen mentioned this issue May 11, 2020

aws lambda socket hang up #1721

Closed

3 tasks

Qard removed their assignment Jun 1, 2020

chemalopezp mentioned this issue Oct 6, 2020

Support AWS Lambda elastic/apm#352

Open

4 tasks

astorm added [zube]: Backlog enhancement and removed [zube]: Ready bug labels Oct 26, 2020

trentm closed this as completed Oct 30, 2020

zube bot added [zube]: Done and removed [zube]: Backlog labels Oct 30, 2020

trentm removed the needs-backlog-triage label Nov 2, 2020

astorm removed the [zube]: Done label Nov 9, 2020

astorm mentioned this issue Nov 10, 2020

lambda: provide a hook for injecting trace-context into Lambda payloads #1329

Closed

trentm mentioned this issue Mar 17, 2021

Why lambda support is deprecated? #2007

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lambda support appears to be broken in v2.x #827

Lambda support appears to be broken in v2.x #827

watson commented Jan 24, 2019

alvarolobato commented Feb 5, 2019

alvarolobato commented Feb 5, 2019

davebond commented Feb 28, 2019

Qard commented Feb 28, 2019

davebond commented Feb 28, 2019

Qard commented Mar 2, 2019

HuddleHouse commented Mar 8, 2019 •

edited

Loading

davebond commented Mar 8, 2019

tallakh commented Apr 15, 2019 •

edited

Loading

Qard commented Apr 15, 2019

watson commented Apr 15, 2019

martinduris commented Apr 20, 2019

willbrazil commented May 29, 2019

watson commented May 29, 2019

willbrazil commented May 29, 2019

24601 commented Feb 16, 2020

davebond commented Feb 17, 2020

chemalopezp commented May 7, 2020

basepi commented May 8, 2020

chemalopezp commented May 8, 2020

spectorar commented May 28, 2020

chemalopezp commented May 28, 2020

chemalopezp commented Jun 1, 2020

basepi commented Jun 2, 2020

chemalopezp commented Jul 9, 2020

basepi commented Jul 9, 2020

chemalopezp commented Aug 20, 2020

basepi commented Aug 20, 2020

basepi commented Oct 2, 2020

trentm commented Oct 30, 2020

Lambda support appears to be broken in v2.x #827

Lambda support appears to be broken in v2.x #827

Comments

watson commented Jan 24, 2019

alvarolobato commented Feb 5, 2019

alvarolobato commented Feb 5, 2019

davebond commented Feb 28, 2019

Qard commented Feb 28, 2019

davebond commented Feb 28, 2019

Qard commented Mar 2, 2019

HuddleHouse commented Mar 8, 2019 • edited Loading

davebond commented Mar 8, 2019

tallakh commented Apr 15, 2019 • edited Loading

Qard commented Apr 15, 2019

watson commented Apr 15, 2019

martinduris commented Apr 20, 2019

willbrazil commented May 29, 2019

watson commented May 29, 2019

willbrazil commented May 29, 2019

24601 commented Feb 16, 2020

davebond commented Feb 17, 2020

chemalopezp commented May 7, 2020

basepi commented May 8, 2020

chemalopezp commented May 8, 2020

spectorar commented May 28, 2020

chemalopezp commented May 28, 2020

chemalopezp commented Jun 1, 2020

basepi commented Jun 2, 2020

chemalopezp commented Jul 9, 2020

basepi commented Jul 9, 2020

chemalopezp commented Aug 20, 2020

basepi commented Aug 20, 2020

basepi commented Oct 2, 2020

trentm commented Oct 30, 2020

HuddleHouse commented Mar 8, 2019 •

edited

Loading

tallakh commented Apr 15, 2019 •

edited

Loading