Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECONNRESET exceptions when running in Lambda environment #1196

Closed
monken opened this issue May 20, 2020 · 35 comments · Fixed by serverless-nextjs/serverless-next.js#720 or #1693
Closed
Assignees

Comments

@monken
Copy link

monken commented May 20, 2020

Describe the bug

import { S3 } from '@aws-sdk/client-s3';
import { Handler, Context, S3Event } from 'aws-lambda';

const s3 = new S3({})

export const handler: Handler = async (event: S3Event, context: Context) => {
  await s3.getObject({
    Bucket: event.Records[0].s3.bucket.name,
    Key: event.Records[0].s3.object.key,
  });
}

We have this very basic lambda function that reads the file from S3 when a new file is uploaded (we actually consume the Body stream too, left that out for brevity). The function is called intermittently meaning that sometimes we get a new Lambda function (i.e. cold) sometimes the Lambda container is reused. When the container is reused, we sometimes see a ECONNRESET exception such as this one

2020-05-20T16:50:28.107Z	d7a43394-afad-4267-a4a4-5ad3633a1db8	ERROR	Error: socket hang up
    at connResetException (internal/errors.js:608:14)
    at TLSSocket.socketOnEnd (_http_client.js:460:23)
    at TLSSocket.emit (events.js:322:22)
    at endReadableNT (_stream_readable.js:1187:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  code: 'ECONNRESET',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}

I'm pretty confident that this is due to the keep-alive nature of the https connection. Lambda processes are frozen after they execute and their host seems to terminate open sockets after ~10 minutes. The next time the S3 client tries to reuse the socket, the exception is thrown.

We are running into similar issues with connections to our Aurora database which also terminates intermittently with the same error message (see brianc/node-postgres#2112). It's an error we can easily recover from if we try to reopen the socket but aws-sdk-v3 seems to prefer to throw an error message instead.

Is the issue in the browser/Node.js?
Node.js 12.x on AWS Lambda

@PoBuchi
Copy link

PoBuchi commented May 26, 2020

Having similar issue using DynamoDB client.

Using:
Node.js 12.x on AWS Lambda
"@aws-sdk/client-dynamodb": "^1.0.0-beta.5"

Error: write ECONNRESET
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:92:16)
    at writevGeneric (internal/stream_base_commons.js:132:26)
    at TLSSocket.Socket._writeGeneric (net.js:782:11)
    at TLSSocket.Socket._writev (net.js:791:8)
    at doWrite (_stream_writable.js:440:12)
    at clearBuffer (_stream_writable.js:558:5)
    at TLSSocket.Writable.uncork (_stream_writable.js:337:7)
    at ClientRequest._flushOutput (_http_outgoing.js:861:10)
    at ClientRequest._flush (_http_outgoing.js:830:22)
    at _http_client.js:315:47 {
  errno: 'ECONNRESET',
  code: 'ECONNRESET',
  syscall: 'write',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}

@samirda
Copy link

samirda commented May 26, 2020

Info
AWS lambda Node.js 12.x
"@aws-sdk/client-dynamodb": "^1.0.0-gamma.1"

Lambda

import { DynamoDBClient, DescribeTableCommand } from "@aws-sdk/client-dynamodb"

const dynamo = new DynamoDBClient({})

export const tempDebug = async (): Promise<object> => {
  const res = await dynamo.send(new DescribeTableCommand({
    TableName: '<TableName>'
  }))

  return Promise.resolve(res.Table)
}

Local

import { LambdaClient, InvokeCommand } from "@aws-sdk/client-lambda"

declare const TextDecoder
const lambda = new LambdaClient({})

;(async () => {
  let counter = 0
  // eslint-disable-next-line no-constant-condition
  while (true) {
    console.log(counter)
    counter++
    const res = await lambda.send(new InvokeCommand({
      FunctionName: '<FunctionName>'
    }))
    
    const obj = JSON.parse(new TextDecoder("utf-8").decode(res.Payload))
    if (obj.errorType === 'Error') {
      console.log(obj)
      break
    }

    //await new Promise(resolve => setTimeout(resolve, 5 * 60 * 1000))
    await new Promise(resolve => setTimeout(resolve, 90 * 1000))
  }
})()

Produces the following errors consistently when run with 5 minute 90 sec intervals. First call works, second call after 5 minutes 90 seconds produce 1 of the following 2 errors.
Error logs are from CloudWatch.

{
    "errorType": "Error",
    "errorMessage": "write EPIPE",
    "code": "EPIPE",
    "errno": "EPIPE",
    "syscall": "write",
    "$metadata": {
        "retries": 0,
        "totalRetryDelay": 0
    },
    "stack": [
        "Error: write EPIPE",
        "    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:92:16)",
        "    at writevGeneric (internal/stream_base_commons.js:132:26)",
        "    at TLSSocket.Socket._writeGeneric (net.js:782:11)",
        "    at TLSSocket.Socket._writev (net.js:791:8)",
        "    at doWrite (_stream_writable.js:401:12)",
        "    at clearBuffer (_stream_writable.js:519:5)",
        "    at TLSSocket.Writable.uncork (_stream_writable.js:338:7)",
        "    at ClientRequest._flushOutput (_http_outgoing.js:862:10)",
        "    at ClientRequest._flush (_http_outgoing.js:831:22)",
        "    at _http_client.js:315:47"
    ]
}
{
    "errorType": "Error",
    "errorMessage": "socket hang up",
    "code": "ECONNRESET",
    "$metadata": {
        "retries": 0,
        "totalRetryDelay": 0
    },
    "stack": [
        "Error: socket hang up",
        "    at connResetException (internal/errors.js:608:14)",
        "    at TLSSocket.socketOnEnd (_http_client.js:453:23)",
        "    at TLSSocket.emit (events.js:322:22)",
        "    at endReadableNT (_stream_readable.js:1187:12)",
        "    at processTicksAndRejections (internal/process/task_queues.js:84:21)"
    ]
}

Works as expected when run with 1 minute intervals.

@samirda
Copy link

samirda commented May 28, 2020

Edit: Better workaround #1196 (comment)

I also changed err.code === 'ECONNRESET' -> (err.code === 'ECONNRESET' || err.code === 'EPIPE')


I think I found a workaround. Client needs to call destroy method.

dynamo.destroy()

import { DynamoDBClient, DescribeTableCommand } from "@aws-sdk/client-dynamodb"

const dynamo = new DynamoDBClient({})

export const tempDebug = async (): Promise<object> => {
  const res = await dynamo.send(new DescribeTableCommand({
    TableName: '<TableName>'
  }))

  dynamo.destroy()

  return Promise.resolve(res.Table)
}

I have not been able to reproduce the issue when run with ts-node v.8.8.1 or with a docker image(node 12.17.0).
I have only had the issue in AWS lambda Node.js 12.x

@monken
Copy link
Author

monken commented May 28, 2020

@samirda thanks for that. However, for highly trafficked lambdas this will eliminate the benefits of keep-alive.

Either way, ECONNRESET is a retry-able error and should be handled that way.

@samirda
Copy link

samirda commented May 28, 2020

@monken I agree but the workaround should be more error safe until a proper fix is in place.

@studds
Copy link

studds commented Jun 1, 2020

I've managed to work around this using this configuration (updated for gamma):

import {
    StandardRetryStrategy,
    defaultRetryDecider,
} from '@aws-sdk/middleware-retry';
import { SdkError } from '@aws-sdk/smithy-client';

const retryDecider = (err: SdkError & { code?: string }) => {
    if (
        'code' in err &&
        (err.code === 'ECONNRESET' ||
            err.code === 'EPIPE' ||
            err.code === 'ETIMEDOUT')
    ) {
        return true;
    } else {
        return defaultRetryDecider(err);
    }
};
// eslint-disable-next-line @typescript-eslint/require-await
const retryStrategy = new StandardRetryStrategy(async () => '3', {
    retryDecider,
});
export const defaultClientConfig = {
    maxRetries: 3,
    retryStrategy,
};

It would be nice if this was built-in to defaultRetryDecider. Although, is there an argument for this being built-in to the NodeHttpHandler, as this is a node-specific error, and one where the handler should probably "just work"?

@abierbaum
Copy link

@studds Thanks for the elegant solution. This is working perfect for me now.

@polovi
Copy link

polovi commented Jul 4, 2020

Is something new or plans to fix this? Every second/third call of lambda with dynamoDB client in peak end with one of these errors

Error: write ETIMEDOUT
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:92:16)
    at writevGeneric (internal/stream_base_commons.js:132:26)
    at TLSSocket.Socket._writeGeneric (net.js:782:11)
    at TLSSocket.Socket._writev (net.js:791:8)
    at doWrite (_stream_writable.js:401:12)
    at clearBuffer (_stream_writable.js:519:5)
    at TLSSocket.Writable.uncork (_stream_writable.js:338:7)
    at ClientRequest._flushOutput (_http_outgoing.js:862:10)
    at ClientRequest._flush (_http_outgoing.js:831:22)
    at _http_client.js:315:47 {
  errno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'write',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}
Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:92:16)
    at writevGeneric (internal/stream_base_commons.js:132:26)
    at TLSSocket.Socket._writeGeneric (net.js:782:11)
    at TLSSocket.Socket._writev (net.js:791:8)
    at doWrite (_stream_writable.js:401:12)
    at clearBuffer (_stream_writable.js:519:5)
    at TLSSocket.Writable.uncork (_stream_writable.js:338:7)
    at ClientRequest._flushOutput (_http_outgoing.js:862:10)
    at ClientRequest._flush (_http_outgoing.js:831:22)
    at _http_client.js:315:47 {
  errno: 'EPIPE',
  code: 'EPIPE',
  syscall: 'write',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}
Error: socket hang up
    at connResetException (internal/errors.js:608:14)
    at TLSSocket.socketOnEnd (_http_client.js:453:23)
    at TLSSocket.emit (events.js:322:22)
    at endReadableNT (_stream_readable.js:1187:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  code: 'ECONNRESET',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}

So I'm unable to use sdk v3 as replace for old sdk

@trivikr
Copy link
Member

trivikr commented Jul 9, 2020

clients in 1.0.0-gamma.3 now retry in case of Transient Errors

It doesn't check for ECONNRESET, ETIMEDOUT or EPIPE though

@maxgr0
Copy link

maxgr0 commented Jul 27, 2020

Got the same problem using 1.0.0-gamma.5 in @aws-sdk/client-dynamodb and @aws-sdk/client-s3.
Dont know where the error comes from as this is extremely random (around every 10th-15th request). With the "old" aws-sdk this is never happening.

@polovi
Copy link

polovi commented Jul 30, 2020

I confirm, it still happens with 1.0.0-gamma.5

@samirda
Copy link

samirda commented Aug 11, 2020

I think this has been resolved in 1.0.0-gamma.6

@monken
Copy link
Author

monken commented Aug 11, 2020

There is no mention of a fix in the release notes: https://github.com/aws/aws-sdk-js-v3/releases/tag/v1.0.0-gamma.6

@samirda
Copy link

samirda commented Aug 11, 2020

I know but I have not been able to reproduce the errors.

@maxgr0
Copy link

maxgr0 commented Aug 12, 2020

Issues are still happening in 1.0.0-gamma.6 😕

@polovi
Copy link

polovi commented Aug 12, 2020

me too, still errors with 1.0.0-gamma.6

@samirda
Copy link

samirda commented Aug 17, 2020

I was to quick, errors still happening in 1.0.0-gamma.6.

@maxgr0
Copy link

maxgr0 commented Aug 25, 2020

Anyone made some progress here? @samirda @polovi @monken

@polovi
Copy link

polovi commented Aug 31, 2020

Thers 1.0.0-gamma.7 but I have no time to test yet

@polovi
Copy link

polovi commented Oct 4, 2020

iam testing 1.0.0-gamma.10 in production with loging over custom retry strategy

@polovi
Copy link

polovi commented Oct 9, 2020

1.0.0-gamma.10 still produce same errors

@rraziel
Copy link

rraziel commented Oct 21, 2020

Has anyone from AWS or a maintainer even commented on this issue?
This seems like this should be a priority given it happens in most use-cases unless you rarely call your lambdas.

@monken
Copy link
Author

monken commented Oct 21, 2020

@AllanFly120 mind to chime in?

@martinjuhasz
Copy link

I'm also seeing this regulary. I increased maxAttemps without any significant improvement.

const dynamodbClient = new DynamoDB({ maxAttempts: 10 })

@monken
Copy link
Author

monken commented Nov 6, 2020

So we are at release candidate 4 and this problem has not even been acknowledged 😞

@martinjuhasz
Copy link

martinjuhasz commented Nov 10, 2020

Using the fix of serverless-nextjs fixed it for me. This is not at all a permanent solution as it will requery continuously when the matched status code will get returned.

TS implementation:

import type { SdkError } from '@aws-sdk/smithy-client'
import {
	defaultRetryDecider,
	StandardRetryStrategy,
} from '@aws-sdk/middleware-retry'

// fix error in SDK release candidate
// see: https://github.com/aws/aws-sdk-js-v3/issues/1196
// see: https://github.com/serverless-nextjs/serverless-next.js/pull/720/files
export const retryStrategy = new StandardRetryStrategy(async () => 3, {
	retryDecider: (err: SdkError & { code?: string }) => {
		if (
			'code' in err &&
			(err.code === 'ECONNRESET' ||
				err.code === 'EPIPE' ||
				err.code === 'ETIMEDOUT')
		) {
			return true
		} else {
			return defaultRetryDecider(err)
		}
	},
})

import { DynamoDB } from '@aws-sdk/client-dynamodb'
const dynamodbClient = new DynamoDB({ retryStrategy })

the retryStrategy prop is available in all clients AFAIK.

Hoping for an actual fix in the next RC

@trivikr trivikr self-assigned this Nov 11, 2020
@trivikr
Copy link
Member

trivikr commented Nov 12, 2020

Documentation of the errors discussed in this thread:

  • ECONNRESET (Connection reset by peer): A connection was forcibly closed by a peer. This normally results from a loss of the connection on the remote socket due to a timeout or reboot. Commonly encountered via the http and net modules.
  • EPIPE (Broken pipe): A write on a pipe, socket, or FIFO for which there is no process to read the data. Commonly encountered at the net and http layers, indicative that the remote side of the stream being written to has been closed.
  • ETIMEDOUT (Operation timed out): A connect or send request failed because the connected party did not properly respond after a period of time. Usually encountered by http or net. Often a sign that a socket.end() was not properly called.

Node.js docs: https://nodejs.org/api/errors.html#errors_common_system_errors

@trivikr
Copy link
Member

trivikr commented Nov 16, 2020

Findings with the example code given in #1196 (comment):

  • The return value from describeTableCommand is deserialized in deserializeAws_json1_0DescribeTableCommand
    export const deserializeAws_json1_0DescribeTableCommand = async (
    output: __HttpResponse,
    context: __SerdeContext
    ): Promise<DescribeTableCommandOutput> => {
    if (output.statusCode >= 300) {
    return deserializeAws_json1_0DescribeTableCommandError(output, context);
    }
    const data: any = await parseBody(output.body, context);
    let contents: any = {};
    contents = deserializeAws_json1_0DescribeTableOutput(data, context);
    const response: DescribeTableCommandOutput = {
    $metadata: deserializeMetadata(output),
    ...contents,
    };
    return Promise.resolve(response);
    };
  • However, when error is re-thrown by handler the deserializer is not called.

The smithy-client doesn't re-throw error:

} else {
return handler(command).then((result) => result.output);
}

Question to answer: When requestHandler throws an error, why deserialize step is not called?

@trivikr
Copy link
Member

trivikr commented Nov 16, 2020

Question to answer: When requestHandler throws an error, why deserialize step is not called?

The deserializer is not called as next function in deserializerMiddleware throws:

const { logger, outputFilterSensitiveLog } = context;
const { response } = await next(args);
const parsed = await deserializer(response, options);

As explained in the hack in #1196 (comment), this can be caught and retried in retryMiddleware as it has try/catch

} catch (err) {
attempts++;
if (this.shouldRetry(err as SdkError, attempts, maxAttempts)) {
retryTokenAmount = this.retryQuota.retrieveRetryTokens(err);
const delay = this.delayDecider(
isThrottlingError(err) ? THROTTLING_RETRY_DELAY_BASE : DEFAULT_RETRY_DELAY_BASE,
attempts
);
totalDelay += delay;
await new Promise((resolve) => setTimeout(resolve, delay));
continue;
}
if (!err.$metadata) {
err.$metadata = {};
}
err.$metadata.attempts = attempts;
err.$metadata.totalRetryDelay = totalDelay;
throw err;
}
}

Question: Should deserializer be called in case next function throws error?
Answer: No, as deserializer doesn't have valid response to be processed if next function throws error. This should be handled in retryMiddleware.

@rraziel
Copy link

rraziel commented Nov 16, 2020

Just wondering (but nice to see the issue is being looked at): are you saying this is "just" an error that's not properly handled?

Because it seems to me like this is hiding the issue under the rug, and retrying when there is a failure after a (long) timeout: there is no reason for the request to fail, as there is no throttling or anything happening.

I can easily reproduce the bug by calling the code once every 10-ish minutes: the lambda process will still be around, there is no other load or anything in parallel, but the previous request was not properly "cleaned up".

Again - this has been mentioned before - simply swapping the new v3 request with the old aws-sdk one works right away without any issue.

We migrated a ton of services to use the v3 SDK and can't merge that branch to production because of this.

@trivikr
Copy link
Member

trivikr commented Nov 16, 2020

Hi @rraziel, I'm currently looking into how JS SDK v2 handles this and will provide a fix in v3 accordingly.

are you saying this is "just" an error that's not properly handled?

The current behavior in undesirable, and the SDK should retry the error instead of asking user to do it.

@trivikr
Copy link
Member

trivikr commented Nov 16, 2020

The fix needs to be in node-http-handler, where it detects ECONNRESET/EPIPE/ETIMEDOUT errors and reject them with name "TimeoutError" so that they can be retried.

Example TimeoutError for socket timeout:

@trivikr
Copy link
Member

trivikr commented Nov 18, 2020

This issue is fixed in #1693, and will be published in rc.7 on Thursday 11/19

@trivikr
Copy link
Member

trivikr commented Nov 20, 2020

Update: The https://github.com/aws/aws-sdk-js-v3/releases/tag/v1.0.0-rc.7 is complete.

Verified that this issue is fixed using example code given in #1196 (comment)

@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.