Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lambda support appears to be broken in v2.x #827

Closed
watson opened this issue Jan 24, 2019 · 47 comments
Closed

Lambda support appears to be broken in v2.x #827

watson opened this issue Jan 24, 2019 · 47 comments

Comments

@watson
Copy link
Contributor

watson commented Jan 24, 2019

See https://discuss.elastic.co/t/apm-server-transport-error-error-write-after-end/165484 for details

@watson watson added the bug label Jan 24, 2019
@alvarolobato
Copy link

@Qard let's timebox some investigation on this and see if we get to an easy fix.

@alvarolobato
Copy link

Related to #798

@davebond
Copy link
Contributor

Has there been any progress on this issue? I can submit our workaround as a PR for this but I'm not sure thats its the best long term solution

@Qard
Copy link
Contributor

Qard commented Feb 28, 2019

I was unable to reproduce it in my initial testing. I did do some restructuring recently to support async/await style though, which has not yet been released. That might impact it.

@davebond
Copy link
Contributor

Unfortunately it is still an issue, from my poking around the cause is:

  • Client connects transport to APM and functions for the first lambda execution
  • Lambda is "frozen", during which the socket is interrupted
  • On next lambda invoke, event loop kicks back in and transport throws an ECONNRESET
  • Client attempts to write transaction data to the socket which is now closed due to the error state, resulting in the write after end issue

Our workaround is to track this error state and ensure that the transport is open before sending any transactions: status200@629cd9f

Though it seems like a larger restructure to track the lifecycle and state of the transport would be more appropriate as this could also be an issue for clients running long enough to encounter network hiccups.

@Qard
Copy link
Contributor

Qard commented Mar 2, 2019

I'm able to reproduce the ECONNRESET, but the client just reconnects and seems to send the new transaction just fine. I'm not getting the write after end error at all.

Are you able to provide a code example that can reproduce the write after end?

@HuddleHouse
Copy link

HuddleHouse commented Mar 8, 2019

For what it is worth, I got it working by editing the code from the docs. I have node8.10 on Lambda.

What the docs show and caused "APM Server transport error: Error: write after end":

exports.handler = apm.lambda(function handler (payload, context, callback) {
  callback(null, `Hello, ${payload.name}!`)
})

What worked for me:

exports.handler = apm.lambda(async (payload, context, callback) => {
  callback(null, `Hello, ${payload.name}!`)
})

@davebond
Copy link
Contributor

davebond commented Mar 8, 2019

Sorry for the delay, @HuddleHouse is correct, the example from the docs still produces the issue.

There are two methods I can see which stop this:

// Pass promise to lambda
exports.handler = apm.lambda(async (payload, context, callback) => {
  callback(null, `Hello, ${payload.name}!`)
})

// callbackWaitsForEmptyEventLoop
exports.handler = apm.lambda(function handler (payload, context, callback) {
  context.callbackWaitsForEmptyEventLoop = false;
  callback(null, `Hello, ${payload.name}!`)
})

Both of these depend on Lambda's handling of the event loop it seems and I wouldn't like to count on this behaviour staying consistent over time.

@tallakh
Copy link

tallakh commented Apr 15, 2019

I still get this error when using the latest apm in a lambda. I start apm like this:

exports.handler = apm.lambda(async (event, context, callback) => {
	...
	callback(null, `Successfully processed ${items} events`);
}

2019-04-15T10:48:16.268Z 4f47d833-61f4-4af4-b824-cf46e5d0880d APM Server transport error: Error: write after end
at writeAfterEnd (/var/task/node_modules/readable-stream/lib/_stream_writable.js:257:12)
at Client.Writable.write (/var/task/node_modules/readable-stream/lib/_stream_writable.js:301:21)
at Client.sendTransaction (/var/task/node_modules/elastic-apm-http-client/index.js:276:15)
at Instrumentation.addEndedTransaction (/var/task/node_modules/elastic-apm-node/lib/instrumentation/index.js:185:22)
at Instrumentation.wrappedAddEndedTransaction [as addEndedTransaction] (/var/task/node_modules/elastic-apm-node/lib/instrumentation/async-hooks.js:53:34)
at Transaction.end (/var/task/node_modules/elastic-apm-node/lib/instrumentation/transaction.js:218:32)
at captureAndMakeCompleter (/var/task/node_modules/elastic-apm-node/lib/lambda.js:48:11)
at wrappedLambdaCallback (/var/task/node_modules/elastic-apm-node/lib/lambda.js:60:20)
at exports.handler.apm.lambda (/var/task/index.js:38:9)
at
at process._tickDomainCallback (internal/process/next_tick.js:228:7)

@Qard
Copy link
Contributor

Qard commented Apr 15, 2019

Yes, it's still a known issue. It's proving to be complicated to fix. Basically, when Lambda freezes the VM it disconnects any sockets still in-use and, when the process wakes up again, it's unaware the sockets have been closed until after it tries to write to them. This is resulting in the write after end happening when the next message is sent to the apm-server.

@watson
Copy link
Contributor Author

watson commented Apr 15, 2019

I've been thinking that we need a way to either force close the underlying TCP socket after each lambda function execution (though that would increase the function running-time overhead), or we should somehow ignore those errors when running in a lambda context (though that might hide some real errors).

We should probably also update the docs do suggest that you disable metrics when running in lambda as that will keep trying to reopen the TCP socket - and metrics inside a lambda doesn't make much sense anyway.

@martinduris
Copy link

Hi all,
so.... current situation is - for everybody reading this issue (please correct me):

Do NOT upgrade/use elastic stack of version 7.x, because elastic-apm-nodejs of version 1.x is not supported with APM of version 7.x

If using elastic-apm-nodejs in AWS lambda, use elastic stack of version < 6.5.x and elastic-apm-node-js of version < 2.x

Martin

@willbrazil
Copy link

Hi @watson,

We should probably also update the docs do suggest that you disable metrics when running in lambda as that will keep trying to reopen the TCP socket

We've tried disabling metrics by passing the metricsInterval config option but we still experience the issue. Do you have other suggestions we may try?

We're running server v7.0.0 with node client ^2.11.5.

Thank you so much for all the work here!

@watson
Copy link
Contributor Author

watson commented May 29, 2019

@willbrazil Setting metricsInterval: 0 will only reduce the likelihood that this error appears, not completely eliminate it. As it stands currently, you might see this error when the VM is being resumed after having been suspended by AWS Lambda. Unfortunately, we can't detect when this happens. However, all data being sent afterward, while the VM is still running, should make it to the APM Server successfully.

If your lambda function is only called once every 20 minutes, this might mean nothing or only very little data get through. But if your lambda function is called several times a minute or more, 99+% of the collected data should get through as expected.

@willbrazil
Copy link

Thank you for such a quick response @watson! Some of our lambdas are handling ~60tps.

We're reproduced in a more controlled environment where we sent a request every second and still experience the issue unfortunately. We'll keep digging here. I know this is a tricky one specially because it's hard to repro locally.

Thank you again for all the info!

@24601
Copy link

24601 commented Feb 16, 2020

Any thoughts on the wisdom of the approach New Relic took (CloudWatch log ingestion basically) versus writing a transport class that write the serialized version of the message to EventBridge or SQS before the lambda returns and a worker or a lambda picks up the serialized form and then batch sends?

I suspect performance penalty in this versus CW which is highly optimized is non-trivial (awaiting even a publish to EB or SQS is probably much longer than a console.log which is likely highly optimized to exit quickly and then get handled later), and probably cheaper than paying the EB event and/or SQS message tax.

Anyone with more knowledge here care to confirm or disagree with my thoughts on the approach? We are considering writing a solution and would be happy to PR it/work on a collective effort if it makes sense.

@davebond
Copy link
Contributor

Sorry for my absence here :)

@Qard There's not much I could share, the only thing that isn't specific to our usage is calling sendTransaction/sendSpan/sendError on the elastic-apm-http-client package. @jonminter looks to be on the right track with it.

@24601 You're right, using anything other than the stdio stream to cloudwatch incurs the network tax, adding either extra latency to each lambda invoke or the possibility of dropped transactions.

New Relic's solution was a pain to implement in an enterprise environment, they provide a python script which sets everything up, which sounds like an alright idea until:

  • Security team vetoes running a third party python script against the production AWs account
  • Ops team vetoes using a python interface outside of Terraform to manage the APM (with good reason)
    I had to pull it apart and reimplement everything above the lambda code itself to get it deployed.

I think publishing an app to SAR would be the cleanest approach:

  • Allow the app the be namespaced on deployment (one of the issues with New Relic)
  • Output the lambda ARN from the deployed stack, so that the cloudwatch events can be attached by the parent orchestration
  • This could be deployed manually or by cloudformation / terraform (terraform would need to invoke cloudformation [1] as there is no direct support)

[0] https://aws.amazon.com/serverless/serverlessrepo/
[1] hashicorp/terraform-provider-aws#3981

@chemalopezp
Copy link

@Qard @watson in case it helps with the prioritization, we are new users thus we have elastic v7 and thus it seems we need to use the latest APM. Most of our infrastructure is on lambda, so we it seems we reached a dead end here until APM supports lambda again. Do you have any date on when this will happen?

@basepi
Copy link

basepi commented May 8, 2020

@chemalopezp No timeline yet, but we currently have an active working group designing our future serverless support. Stay tuned!

@chemalopezp
Copy link

Thank you @basepi, those are great news. I'll keep an eye for updates here ;)

@spectorar
Copy link

Similar to the lambda -> cloudwatch -> lambda -> elasticsearch approach, a lambda -> cloudwatch -> kinesis -> functionbeat -> elasticsearch flow could work nicely as well, if we could somehow configure our existing functionbeat to handle the APM messages.

@chemalopezp
Copy link

@spectorar that would be great. Currently we are able to stream logs from lambda -> clodwatch -> functionbeat lambda -> elasticsearch (even without kinesis the functionbeat lambda seems to work just fine). I really like this approach, that could be used for anything streaming into a cloudwatch log (the functionbeat lambda simply subscribes to that log group and streams into elasticsearch)

@chemalopezp
Copy link

@basepi do you have by chance any news regarding the AWS Lambda support for the APM agent? I can imagine the actual implementation might take much longer but it would be great to know that will eventually be supported in the near term. Thanks!

@Qard Qard removed their assignment Jun 1, 2020
@basepi
Copy link

basepi commented Jun 2, 2020

No ETA, but we have an active working group finalizing the design for AWS.

@chemalopezp
Copy link

Hi @basepi ! Any updates regarding the APM support for AWS Lambda? Maybe some scope of work and estimated date, or at least if there was a decision to support it to some extent? Thank you very much!

@basepi
Copy link

basepi commented Jul 9, 2020

I think I can safely say we are planning to support it, but we don't have anything to share yet on roadmap timing.

@chemalopezp
Copy link

Hi @basepi I hope you don't mind bugging you again! Is there by chance any update about APM support for AWS Lambda? Thanks!

@basepi
Copy link

basepi commented Aug 20, 2020

No update at the moment. Sorry for the slow progress here, between summer vacations and some additional testing we still haven't settled on the design.

@basepi
Copy link

basepi commented Oct 2, 2020

Quick update here -- while we have a pretty good idea of the design, I don't know when the actual development work will reach the top of the roadmap. As a result, I still can't give any clear idea of when we'll support AWS lambda.

@chemalopezp if you're still interested in this, would you mind opening an issue in https://github.com/elastic/apm/issues ? That repo is a good place for cross-agent feature requests and is used heavily for roadmapping and gauging community interest. Thanks!

@trentm
Copy link
Member

trentm commented Oct 30, 2020

I'm closing this in favour of elastic/apm#352.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests