-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lambda servicecachehealth
times out
#5467
Comments
azul-service-anvilprod-servicecachehealth
errored due to a timeoutazul-service-anvilprod-servicecachehealth
failed due to a timeout
@nadove-ucsc: "Spike to correlate the failed request with an invocation of the service lambda in CloudWatch Log Insights. Since the lambda timed out, the log entry may be missing from ApiGateway." |
This is a similar behavior to what's observed in CloudWatch Logs Insights
|
This might be the same underlying problem as in #5472 (comment). |
The multiple request to the various |
@nadove-ucsc: "Logs show that during the past weeks, when there is a delayed response from TDR the delay doesn't exceed 12 seconds. Because the TDR client is configured to attempt a maximum of two connect retries each with a five second timeout, the maximum latency we can observe from TDR is 15 seconds. The service cache health lambda has a timeout of 10 seconds, meaning that it may timeout even when the service lambda is within the bounds of its expected latency. We either need to increase the timeout of the service cache Lambda or reduce the timeout of the TDR retries." |
@hannes-ucsc: "Assignee to determine how often the service lambda times out in response to requests made by the servicecachehealth lambda, in terms of a ratio between number of total requests and timed out requests" |
From 08.01.23 to 09.07.23, the following number of [
{
"bin(24h)": "2023-09-05 00:00:00.000",
"count(@requestId)": "121"
},
{
"bin(24h)": "2023-08-30 00:00:00.000",
"count(@requestId)": "2"
},
{
"bin(24h)": "2023-08-29 00:00:00.000",
"count(@requestId)": "1"
},
{
"bin(24h)": "2023-08-24 00:00:00.000",
"count(@requestId)": "114"
},
{
"bin(24h)": "2023-08-11 00:00:00.000",
"count(@requestId)": "2"
},
{
"bin(24h)": "2023-08-08 00:00:00.000",
"count(@requestId)": "1"
}
] … note that, approximately 1440 service/indexer cachehealth executions run on a daily basis, with each execution making five service requests (one per entity endpoint). [
{
"bin(24h)": "2023-09-05 00:00:00.000",
"count(@requestId)": "2"
},
{
"bin(24h)": "2023-08-30 00:00:00.000",
"count(@requestId)": "2"
},
{
"bin(24h)": "2023-08-10 00:00:00.000",
"count(@requestId)": "1"
},
{
"bin(24h)": "2023-08-09 00:00:00.000",
"count(@requestId)": "3"
},
{
"bin(24h)": "2023-08-08 00:00:00.000",
"count(@requestId)": "3"
}
] … and these, the count of the service executions that took longer than 10s … [
{
"bin(24h)": "2023-09-05 00:00:00.000",
"count(@requestId)": "590"
},
{
"bin(24h)": "2023-08-30 00:00:00.000",
"count(@requestId)": "2"
},
{
"bin(24h)": "2023-08-24 00:00:00.000",
"count(@requestId)": "572"
},
{
"bin(24h)": "2023-08-18 00:00:00.000",
"count(@requestId)": "1"
},
{
"bin(24h)": "2023-08-14 00:00:00.000",
"count(@requestId)": "1"
},
{
"bin(24h)": "2023-08-11 00:00:00.000",
"count(@requestId)": "9"
},
{
"bin(24h)": "2023-08-10 00:00:00.000",
"count(@requestId)": "1"
},
{
"bin(24h)": "2023-08-09 00:00:00.000",
"count(@requestId)": "4"
},
{
"bin(24h)": "2023-08-08 00:00:00.000",
"count(@requestId)": "6"
}
] |
@hannes-ucsc: "I'd like to tackle this by optimizing the cachehealth Lambda and our retry policy with the goal of reducing the number of false alarms to an acceptably low level: I don't think we need to make a service request for every entity type, (which also eliminates the need for a thread pool) and we shouldn't retry on read when the timing is restricted. The use of a thread pool for making service requests in parallel also subverts our caching of accessible sources." |
Assignee to consider next steps in light of updated description. |
@hannes-ucsc: "@achave11-ucsc and I research another occurrence. We looked at the S3 data events in the trail and could not find a data event for the boto3 PUT request that timed out. We also looked at the VPC flow logs between the Lambda and the NAT Gateway as well as the NAT Gateway and the external IPs but could not find anything out of the ordinary, specifically no dropped or rejected connection. While it's still possible that the network is flaky, the garbage collection hypothesis is more likely." |
Actually, googling around, there are reports of intermittent connection errors between a Lambda function in a VPC and S3, and of the fact that enabling a VPC endpoint for S3 fixes them. Currently, all traffic from our Lambda functions and AWS services is routed via a NAT gateway and that seems to be causing the flakiness. There are two types of endpoints to choose from for S3, gateway endpoints and interface endpoints. The former operate at the routing level and, unlike the latter, don't require configuring clients to use a dedicated hostname for the endpoint. It appears that interface endpoints function like load balancers or proxies and they do require special configuration for clients. Gateway endpoints are free, interface endpoints are not, but still four times cheaper than NAT gateways. On the other hand, gateway endpoints are currently only available for S3 and DynamoDB. We should enable gateway endpoints for both to resolve this issue, and follow up with enabling interface endpoints for other AWS services heavily used by our Lambda functions, SQS comes to mind. We should also double check that traffic between the Lambda functions and the ES domains aren't routed through the NAT gateway. |
Assignee to do that.
Assignee to run the VPC Reachability Analyzer for that. They may need to coordinate with me for assistance. |
VPC Reachability Analyzer findings: The analyzer was ran from the Lambda (in personal deployment) directly to the IP from the ES
The results exhibit the desired behavior, traffic between Lambda and ES domain not being routed through the NAT gateway:
For good measure, we also analyzed the behavior of the connection between the Lambda and a known public IP (8.8.8.8) which exhibits the behavior of the traffic being routed through the NAT:
|
… when requesting
https://service.prod.anvil.gi.ucsc.edu/index/files?size=1
to build the health object.Logs Insights
EDIT: (@achave11-ucsc)
Better logging has now allowed for the detection of an S3 PUT operation timing out. It's somewhat surprising since the object isn't of significant size.
CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-prod-servicecachehealth
start-time: 2024-01-04T18:34:57.841Z
end-time: 2024-01-05T04:01:43.597Z
query-string:
The text was updated successfully, but these errors were encountered: