-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observations regarding instability of native fetch in AWS Lambda x86_64 nodejs20.x #3133
Comments
Can you try to use the undici package instead of the bundled version? To see if the problem still occurs in newer versions of undici. |
Because the issues are intermittent, and do not occur locally, we cannot really test for them. We can only get log information from failed invocations from production realistically, over a number of hours or days. We just deployed a new version that should log the stack trace of the I understand that rolling out a version that uses the latest version of Because this is an annoying production issue (~3,5% of interactions failing), we must focus on stability first. Our plan is to revert to using We will report further information when the errors appear with more log information in the mean time. We understand the report is not very concrete, and regret we cannot be of anymore help. Yet we believed the observations we made are important enough to share here, if only for reference if other people encounter similar issues (e.g., #2964). |
@jandppw Have you considered simply using |
@ronag No, we haven’t. We weren’t aware of the fact that We were under the impression that |
Not sure. Where would you expect that to be documented?
fetch was added due to popular demand not due to technical merits. It's never going to be as fast as |
But that would require adding If that is the case, in the current project, at this time, we’d rather revert to proven safe ground (i.e., Before the above occurred, we were not familiar with For future work, no dependency trumps any dependency, unless there are very good reasons (such was working with promises instead of consuming events ourselves). In that space, now that we have learned about But, to answer the question “Where would you expect that to be documented?”: well, first of all in blogs that I hit with a Google search about doing API calls inside a Node application. This would point me to undici @ npm. None of this is mentioned there. Next, we would have to choose between the contenders. For us, this is probably a bit late. We've been using Now, you mentioning that we should stay away from using |
I would really love that the AWS Lambda team helped out on this one. We lack visibility on the Node.js distribution they are running. Essentially they are likely doing certain things to the Node.js runtime that are causing this.
Given you are using Node.js for free but paying for AWS Lambda, I recommend you to contact AWS support.
I suspect it's due to keep alive connections. Lambda suspends the process and all its state after a response is sent. Undici keep alive the connections, but when does Lambda resume the process we do not know that if we need to reset all the connections. There is no event for us to hook into on resume or suspend. Try adding a |
@mcollina You are absolutely right that this might very well be an AWS issue, or an interop issue (as noted in the report). We just thought that sharing our observations publicly might help other people that might encounter the same issue (as in “no, I’m not crazy”). AWS support will be made aware of the issue. Your hint about a Lambda resuming seems plausible. |
@mcollina Actually, it turns out: no. There seems to be no easy and 0€ way to make AWS aware of the issue for this client. Our main focus is getting stability back to our users. I share your frustration about AWS not being involved, but there seems to be no easy way for us to get AWS on the case without paying extra €. The only way to get AWS involved seems to be be public side channels, such as social media. This public report might be a first effort. Any response from AWS now (if I were in their place anyway) would be that native Apart from all that, there are no ill feelings here regarding In the mean time, this report might help other people in realizing that it probably is not a good idea to use native |
Out of curiosity, I attempted to replicate this in us-east-1 and couldn't. I set up a CloudFront distribution with a Lambda named PrivateLambda (invoked via HTTPS via a function URL) as the origin. Then set up a second lambda named PublicLambda (with a function URL) that calls the CloudFront distribution 25 times in a loop. I called PublicLambda repeatedly using k6, ramping concurrency up and down to suss out out cold/warm start issues. Every request succeeded:
Over 35 minutes this made 49,645 requests to to PublicLambda, which means we made successfully made 1,241,125 fetch calls. Hopefully I did my math right and this little experiment only cost me a few dollars 😅 Attached cloudformation template for set-up and k6 test file: |
The issues shared seem to follow a pattern of constant connection and abrupt disconnections, I believe I'm seeing a similar issue as @mcollina states. I'd suggest to give it a try to his suggestion and start from there. The lambda runtime might be doing constant cleanups while resuming causing intermittent connect issues. |
We yesterday deployed a version that logs the stack trace of the
These again were the first |
We are using
The AWS lambda function is configured to use |
@ajinkyashukla29dj [1] https://docs.aws.amazon.com/lambda/latest/dg/runtimes-modify.html |
One more error came in, this time of the of the
ThIs again was the first fetch call during the invocation. The time between the fetch call and the logging of the error was <3ms. |
@jandppw I think your best course of action is to swap out native fetch for btw: not an undici code contributor, just curious at this point. |
This comment has been minimized.
This comment has been minimized.
3 more errors of the A new version is now rolled out on production that no longer uses native |
@mweberxyz Thank you for your inputs. I don't see the
|
@ajinkyashukla29dj sorry can't help with that error it's coming from the Can you provide some more details about your lambda so I can try to replicate
|
For me this is not AWS related, it happens on macOS "all the time" (very often) for a very simple scraper project that does about 500 fetch calls or something. To confirm I ported it to another impl. of fetch and it worked without stability issues there. The issue I have is that ETIMEDOUT and similar is not being propagated to my try/catch but instead just crashes Node.js somewhere inside undici. It can be reproduced with await fetch inside try/catch leading to crashed Node.js process. |
@uNetworkingAB Can you give a code sample? Knowing which remote servers you are having trouble with would help with replication. |
Please verify if undici v6.14.1 fixes this. |
@jandppw my name is Luca, I work for AWS as Severless Specialist, I'd love to connect with you to review your challenge with the service team. Would you mind ping me in private so we can arrange a chat? |
@lucamezzalira Sure, Luca. Sending you an email now. |
It might be helpful to collect logs with |
Chiming in to say I get "cause EPIPE" on macOS when issuing fetch requests in a tight loop. So this appears to be a buffering issue: (async () => {
while (packet != null) {
await new Promise(resolve => setTimeout(resolve, 0.2));
overrideNow(packet.date);
if (packet.type === 'client') {
replayClientCommand(packet.command);
} else {
replayServerMessage(packet.message);
}
packet = reader.next();
}
})(); Basically the above is some logic I have to replay sessions from captured session files. If I remove the timeout/await then I run into |
Lambda on Node 22 is available now. Native fetch is no longer an experimental feature I believe. Can anybody give any feedback on whether this issue was finally diagnosed, handled, and fixed? |
I think this should be solved, but I'll let others to confirm. |
We observe consistent intermittent errors when using native
fetch
in an AWS Lambdax86_64
nodejs20.x
. We wereunable to find the particular place in
undici
code that is the culprit of this behavior.The intention of this submission is to report the issues and to record our observations as accurately as possible, to enable
people with better detailed knowledge to isolate and possibly fix the issues, if they are not already fixed (
fetch
is stillexperimental in Node 20).
Stable precursor
Versions of this service have been in production since 2022-05-19 on AWS Lambda x86_64 nodejs16.x using
node-fetch@2.6.7
without many issues. Notably, in the last 6 months of2023 exactly 1 communication error calling the remote API was logged. It handles ~700 invocations per day.
Where are the issues observed?
The functionality of the service requires it to do ±25 remote API calls per invocation. The remote API is hosted on a
Windows VM through IIS 8, and is accessed through a Cloudfront distribution, setup to use
http2and3
.We observe this issue, or these issues, only in AWS Lambda x86_64 nodejs20.x, and have not observed the issue during
development on macOS (This does not guarantee that the issues are limited to AWS Lambda x86_64 nodejs20.x, but that it
is possible that the issues are limited to AWS Lambda x86_64 nodejs20.x).
A new version of the code was released at 2024-03-25, but this did not have any impact on the issues. It seems AWS
Lambda x86_64 nodejs20.x changed Node version at 2024-04-03. We cannot report on the exact Node version (and thus
undici
version) because sadly we do not log that. Given that the latest LTS version of node usesundici@5.28.4
, thisreports on versions
undici@<=5.28.4
. The issues might have been fixed already in later releases, although we don’t seeany relevant mentioning of it in the release notes since v5.25.1. Between
2024-02-06 and 2024-04-03 the errors are reported on
node:internal/deps/undici/undici:12344:11
. Between 2024-04-03 and2024-04-15 the errors are reported on
node:internal/deps/undici/undici:12345:11
.~1% of invocations fail with 4 ≠ types of failures
The version of this service that shows the issues has been in production since 2024-02-06. It is deployed on AWS Lambda
x86_64 nodejs20.x, and replaced
node-fetch
with nativefetch
. Between 2024-02-06 and 2024-04-15 about 49000 Lambdainvocations have been made and logged, of which 338 failed. That is about 0.7%. When an invocation does not fail, it
calls
fetch
~25 times. This means ~0,03% offetch
calls fail, but, as you will see, the impact might be greater,because most failures appear to happen on the first
fetch
call of an invocation of a fresh Lambda instance.All failures are logged as 1 of 4 “fetch failed”
cause
variants:{ cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } }
(163)UND_ERR_SOCKET
(152){ cause: { name: 'ConnectTimeoutError', code: 'UND_ERR_CONNECT_TIMEOUT', message: 'Connect Timeout Error' } }
(16){ cause: { errno: -32, code: 'EPIPE', syscall: 'write' } }
(7)The
UND_ERR_SOCKET
causes are of the general formSometimes the remote address and remote port is filled out, sometimes not.
Sadly, we cannot pinpoint where the exact errors occur, since our logs do not store the stack trace of
cause
. Findingout through code inspection did not turn up anything — the usage of
processResponse
is a bit opaque.ETIMEDOUT
andEPIPE
All samples of the
ETIMEDOUT
andEPIPE
cause
we examined show < 15ms between theFETCHING…
log and theERROR
log, and these calls are not registered in the remote API access log. We see the
FETCHING…
log, but never theRESPONSE_RECEIVED
log in these cases (see below).In all samples we examined, the error occurs in the first call to
fetch
in the invocation. We assume these calls neverleft the Lambda environment, and they might occur on the first call to
fetch
after spawning a Lambda instance.This makes us believe these cases are bugs in
undici
, or in the AWS Lambda x86_64 nodejs20.x OS.UND_ERR_CONNECT_TIMEOUT
andUND_ERR_SOCKET
The
UND_ERR_CONNECT_TIMEOUT
andUND_ERR_SOCKET
do not follow this pattern. Multiple successful calls of the remoteAPI with
fetch
are successful before the error occurs. We assume these calls do go out to the remote API.Yet, such errors rarely occured when we used this code on AWS Lambda x86_64 nodejs16.x with
node-fetch@2.6.7
(1 such error in the last 6 months of 2023).Code
fetch
is used twice in this code base, but all errors spawn from the same use:The
log.info
code is:Note that
now
in the loggedJSON
is determined immediately before the call tofetch
synchronously.The
try
-catch
around the call tofetch
is several levels higher in the call stack:The
log.error
code is:Note that
now
in the loggedJSON
is determined here.From the logged
now
in theFETCHING…
log and the loggednow
in theERROR
log we can determine an upper bound ofthe duration between our
fetch
call and the occurrence of the error.Possibly related issues
The text was updated successfully, but these errors were encountered: