Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lambda servicecachehealth times out #5467

Closed
14 tasks done
achave11-ucsc opened this issue Aug 15, 2023 · 22 comments
Closed
14 tasks done

Lambda servicecachehealth times out #5467

achave11-ucsc opened this issue Aug 15, 2023 · 22 comments
Assignees
Labels
+ [priority] High bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms orange [process] Done by the Azul team spike:1 [process] Spike estimate of one point

Comments

@achave11-ucsc
Copy link
Member

achave11-ucsc commented Aug 15, 2023

… when requesting https://service.prod.anvil.gi.ucsc.edu/index/files?size=1 to build the health object.

Logs Insights

2023-08-11T15:28:48.006Z f3feb75c-6d88-4a66-9534-d82309b5ffd1 Task timed out after 10.02 seconds

EDIT: (@achave11-ucsc)

Better logging has now allowed for the detection of an S3 PUT operation timing out. It's somewhat surprising since the object isn't of significant size.

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-prod-servicecachehealth
start-time: 2024-01-04T18:34:57.841Z
end-time: 2024-01-05T04:01:43.597Z
query-string:

fields @timestamp
| filter @message like /0d162516/
| filter @message not like /END/
| parse @message '05fb54b90376*' as msg
| parse @log 'lambda/*' as lambda
| sort @timestamp asc
| limit 20

@timestamp msg lambda
2024-01-04 23:53:29.896 Version: $LATEST azul-service-prod-servicecachehealth
2024-01-04 23:53:29.896 azul.health Getting health property 'api_endpoints' azul-service-prod-servicecachehealth
2024-01-04 23:53:29.897 azul.health Making HEAD request to https://service.azul.data.humancellatlas.org/index/files?size=1 azul-service-prod-servicecachehealth
2024-01-04 23:53:34.963 azul.health Got 200 response after 5.065s from HEAD request to https://service.azul.data.humancellatlas.org/index/files?size=1 azul-service-prod-servicecachehealth
2024-01-04 23:53:34.964 azul.health Getting health property 'elasticsearch' azul-service-prod-servicecachehealth
2024-01-04 23:53:34.964 elasticsearch Making HEAD request to https://vpc-azul-index-prod-mjluewlvxikvhillefxotw2tbm.us-east-1.es.amazonaws.com:443/ azul-service-prod-servicecachehealth
2024-01-04 23:53:34.964 elasticsearch … without request body azul-service-prod-servicecachehealth
2024-01-04 23:53:35.001 elasticsearch Got 200 response after 0.037s from HEAD to https://vpc-azul-index-prod-mjluewlvxikvhillefxotw2tbm.us-east-1.es.amazonaws.com:443/ azul-service-prod-servicecachehealth
2024-01-04 23:53:35.001 elasticsearch … with response body '' azul-service-prod-servicecachehealth
2024-01-04 23:53:35.002 azul.boto3 s3.PutObject: Making PUT request to https://s3.amazonaws.com/edu-ucsc-gi-platform-hca-prod-storage-prod.us-east-1/health/service azul-service-prod-servicecachehealth
2024-01-04 23:53:35.002 azul.boto3 s3.PutObject: … with nonprintable body (<class '_io.BytesIO'>) azul-service-prod-servicecachehealth
2024-01-04 23:54:08.983 Task timed out after 39.09 seconds azul-service-prod-servicecachehealth
2024-01-04 23:54:08.983 Duration: 39087.51 ms Billed Duration: 39000 ms Memory Size: 128 MB Max Memory Used: 121 MB azul-service-prod-servicecachehealth

  • Security design review completed; the Resolution of this issue does not
    • … affect authentication; for example:
      • OAuth 2.0 with the application (API or Swagger UI)
      • Authentication of developers with Google Cloud APIs
      • Authentication of developers with AWS APIs
      • Authentication with a GitLab instance in the system
      • Password and 2FA authentication with GitHub
      • API access token authentication with GitHub
      • Authentication with
    • … affect the permissions of internal users like access to
      • Cloud resources on AWS and GCP
      • GitLab repositories, projects and groups, administration
      • an EC2 instance via SSH
      • GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
    • … affect the permissions of external users like access to
      • TDR snapshots
    • … affect permissions of service or bot accounts
      • Cloud resources on AWS and GCP
    • … affect audit logging in the system, like
      • adding, removing or changing a log message that represents an auditable event
      • changing the routing of log messages through the system
    • … affect monitoring of the system
    • … introduce a new software dependency like
      • Python packages on PYPI
      • Command-line utilities
      • Docker images
      • Terraform providers
    • … add an interface that exposes sensitive or confidential data at the security boundary
    • … affect the encryption of data at rest
    • … require persistence of sensitive or confidential data that might require encryption at rest
    • … require unencrypted transmission of data within the security boundary
    • … affect the network security layer; for example by
      • modifying, adding or removing firewall rules
      • modifying, adding or removing security groups
      • changing or adding a port a service, proxy or load balancer listens on
  • Documentation on any unchecked boxes is provided in comments below
@achave11-ucsc achave11-ucsc added the orange [process] Done by the Azul team label Aug 15, 2023
@achave11-ucsc achave11-ucsc changed the title Lambda azul-service-anvilprod-servicecachehealth errored due to a timeout Lambda azul-service-anvilprod-servicecachehealth failed due to a timeout Aug 15, 2023
@achave11-ucsc
Copy link
Member Author

@nadove-ucsc: "Spike to correlate the failed request with an invocation of the service lambda in CloudWatch Log Insights. Since the lambda timed out, the log entry may be missing from ApiGateway."

@achave11-ucsc achave11-ucsc added the spike:1 [process] Spike estimate of one point label Aug 15, 2023
@achave11-ucsc achave11-ucsc self-assigned this Aug 15, 2023
@achave11-ucsc
Copy link
Member Author

This is a similar behavior to what's observed in Lambda azul-service-dev timesout and causes an alarm #5472.
The service lambda has a execution timeout of 31 secs, and it may take longer than 10s to obtain /head responses from the API endpoints, 10s is the servicecachehealth execution timeout.

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-anvilprod-servicecachehealth, /aws/lambda/azul-service-anvilprod
start-time: 2023-08-11T15:28:29.000Z
end-time: 2023-08-11T15:29:09.000Z
query-string:

fields @timestamp, @message
| filter @message like /START|HEAD|REPORT/
| sort @timestamp asc
| limit 50

@timestamp @message
2023-08-11 15:28:37.986 START RequestId: f3feb75c-6d88-4a66-9534-d82309b5ffd1 Version: $LATEST
2023-08-11 15:28:38.623 START RequestId: ccfc09f7-47d6-4057-94ca-6a4093fe5b0e Version: $LATEST
2023-08-11 15:28:38.626 [INFO] 2023-08-11T15:28:38.624Z ccfc09f7-47d6-4057-94ca-6a4093fe5b0e azul.chalice Received HEAD request for '/index/datasets', with {"query": {"size": "1"}, "headers": {"accept": "/", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.prod.anvil.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 2f66aa06710fece8ed203ab0ea81eb56.cloudfront.net (CloudFront)", "x-amz-cf-id": "TTN-ca7DEX86W3d1YwWZMvhzggTdRbNy-OUP5XPR8JjrgDfLiwEI2w==", "x-amzn-trace-id": "Root=1-64d653a6-3af69e3067c2d14c0b13975b", "x-forwarded-for": "35.168.152.160, 130.176.98.154", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2023-08-11 15:28:38.729 START RequestId: dca268db-84ff-4549-9bec-47aa109b0d39 Version: $LATEST
2023-08-11 15:28:38.729 [INFO] 2023-08-11T15:28:38.729Z dca268db-84ff-4549-9bec-47aa109b0d39 azul.chalice Received HEAD request for '/index/activities', with {"query": {"size": "1"}, "headers": {"accept": "/", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.prod.anvil.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 e3e94284a800d30d02bd662be67e1bf2.cloudfront.net (CloudFront)", "x-amz-cf-id": "ocERjZmXR9yUTD6zZraMTFRzHrOelHU3BeTAZvwUAXwl15eqbI1SuA==", "x-amzn-trace-id": "Root=1-64d653a6-455f98b1610609da18e54a3e", "x-forwarded-for": "35.168.152.160, 130.176.98.85", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2023-08-11 15:28:38.811 START RequestId: a8ba3c0f-65fa-4bb3-948d-da2052e9b33a Version: $LATEST
2023-08-11 15:28:38.812 [INFO] 2023-08-11T15:28:38.812Z a8ba3c0f-65fa-4bb3-948d-da2052e9b33a azul.chalice Received HEAD request for '/index/donors', with {"query": {"size": "1"}, "headers": {"accept": "/", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.prod.anvil.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 157ebd6865840045fc8b5ed1cce7e466.cloudfront.net (CloudFront)", "x-amz-cf-id": "BZMFJdu21Pi_Lm9oT4JqVwtB7rv8xkEupTvPu0uSogJelHR-Z4JuSg==", "x-amzn-trace-id": "Root=1-64d653a6-58743e3108b286402d7630d4", "x-forwarded-for": "35.168.152.160, 130.176.98.142", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2023-08-11 15:28:38.831 START RequestId: bb742e51-9730-4635-b5f5-52133b4d239c Version: $LATEST
2023-08-11 15:28:38.835 [INFO] 2023-08-11T15:28:38.832Z bb742e51-9730-4635-b5f5-52133b4d239c azul.chalice Received HEAD request for '/index/biosamples', with {"query": {"size": "1"}, "headers": {"accept": "/", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.prod.anvil.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 ffa4b37ccdc94a8c62bf6b6414725210.cloudfront.net (CloudFront)", "x-amz-cf-id": "SkHSMFRSJY0En2S_jdgGAYrIweN-uRSTeoJ68UJz3sJEmeOxE7WM3g==", "x-amzn-trace-id": "Root=1-64d653a6-14ecea643dc679562802a771", "x-forwarded-for": "35.168.152.160, 130.176.98.139", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2023-08-11 15:28:38.915 START RequestId: 59cfe9e5-d3de-48db-87a5-1786b1bab6cc Version: $LATEST
2023-08-11 15:28:38.918 [INFO] 2023-08-11T15:28:38.916Z 59cfe9e5-d3de-48db-87a5-1786b1bab6cc azul.chalice Received HEAD request for '/index/files', with {"query": {"size": "1"}, "headers": {"accept": "/", "accept-encoding": "gzip, deflate", "cloudfront-forwarded-proto": "https", "cloudfront-is-desktop-viewer": "true", "cloudfront-is-mobile-viewer": "false", "cloudfront-is-smarttv-viewer": "false", "cloudfront-is-tablet-viewer": "false", "cloudfront-viewer-asn": "14618", "cloudfront-viewer-country": "US", "host": "service.prod.anvil.gi.ucsc.edu", "user-agent": "python-requests/2.31.0", "via": "1.1 b471d3775e81a9be536b52b99f39452a.cloudfront.net (CloudFront)", "x-amz-cf-id": "RP0UFv01s7G8NsgoEqxLm-MD-VJEKD4_n6cB0DleM41Ws9KeU10rlw==", "x-amzn-trace-id": "Root=1-64d653a6-6899384a697c2ea654b68edb", "x-forwarded-for": "35.168.152.160, 130.176.98.76", "x-forwarded-port": "443", "x-forwarded-proto": "https"}}.
2023-08-11 15:28:48.006 REPORT RequestId: f3feb75c-6d88-4a66-9534-d82309b5ffd1 Duration: 10020.03 ms Billed Duration: 10000 ms Memory Size: 128 MB Max Memory Used: 122 MB
2023-08-11 15:28:48.159 INIT_START Runtime Version: python:3.9.v26 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:130681a0855afedf31b2b3fbcc2fbf1ca62875e0500edb56fd16cad65045b05b
2023-08-11 15:28:49.747 REPORT RequestId: bb742e51-9730-4635-b5f5-52133b4d239c Duration: 10915.00 ms Billed Duration: 10916 ms Memory Size: 2048 MB Max Memory Used: 149 MB
2023-08-11 15:28:49.832 REPORT RequestId: 59cfe9e5-d3de-48db-87a5-1786b1bab6cc Duration: 10916.96 ms Billed Duration: 10917 ms Memory Size: 2048 MB Max Memory Used: 158 MB
2023-08-11 15:28:50.519 REPORT RequestId: a8ba3c0f-65fa-4bb3-948d-da2052e9b33a Duration: 11707.90 ms Billed Duration: 11708 ms Memory Size: 2048 MB Max Memory Used: 165 MB
2023-08-11 15:28:53.711 REPORT RequestId: ccfc09f7-47d6-4057-94ca-6a4093fe5b0e Duration: 15088.10 ms Billed Duration: 15089 ms Memory Size: 2048 MB Max Memory Used: 160 MB
2023-08-11 15:28:53.809 REPORT RequestId: dca268db-84ff-4549-9bec-47aa109b0d39 Duration: 15080.19 ms Billed Duration: 15081 ms Memory Size: 2048 MB Max Memory Used: 215 MB
2023-08-11 15:28:55.416 START RequestId: 466ba2a9-0366-47ac-adb2-2c6c3f39a257 Version: $LATEST
2023-08-11 15:28:55.417 START RequestId: 0792fe28-c7d5-4435-b154-4c0339fe9089 Version: $LATEST
2023-08-11 15:28:55.601 REPORT RequestId: 466ba2a9-0366-47ac-adb2-2c6c3f39a257 Duration: 184.29 ms Billed Duration: 185 ms Memory Size: 2048 MB Max Memory Used: 215 MB
2023-08-11 15:28:55.804 REPORT RequestId: 0792fe28-c7d5-4435-b154-4c0339fe9089 Duration: 387.19 ms Billed Duration: 388 ms Memory Size: 2048 MB Max Memory Used: 170 MB

@achave11-ucsc
Copy link
Member Author

This might be the same underlying problem as in #5472 (comment).

@achave11-ucsc
Copy link
Member Author

AWS Logs Insights

The multiple request to the various /index/{entity_type} endpoints took longer than expected, because they received timeouts from the /api/repository/v1/snapshots/roleMap request that the service makes. This requests to Terra timed out after 5 seconds, and each was retried at least twice, with each having a timeout of 5 seconds, two of those retries already puts the servicecache lambda beyond its execution time limit.

@achave11-ucsc
Copy link
Member Author

achave11-ucsc commented Aug 23, 2023

This may also be affecting azul-indexer-anvildev-indexercachehealth, which recently failed due to a Lambda execution timeout.
Screen Shot 2023-08-22 at 5 07 08 PM

Around this time, the GL anvildev instance was being rebooted as part of scheduled volume backup, but I'm unsure that it's related.

@achave11-ucsc
Copy link
Member Author

@nadove-ucsc: "Logs show that during the past weeks, when there is a delayed response from TDR the delay doesn't exceed 12 seconds. Because the TDR client is configured to attempt a maximum of two connect retries each with a five second timeout, the maximum latency we can observe from TDR is 15 seconds. The service cache health lambda has a timeout of 10 seconds, meaning that it may timeout even when the service lambda is within the bounds of its expected latency. We either need to increase the timeout of the service cache Lambda or reduce the timeout of the TDR retries."

@achave11-ucsc
Copy link
Member Author

Degraded performance from TDR (503 responses) also causes the servicecache lambda to timeout.
Screen Shot 2023-08-24 at 11 12 13 AM

@dsotirho-ucsc
Copy link
Contributor

@hannes-ucsc: "Assignee to determine how often the service lambda times out in response to requests made by the servicecachehealth lambda, in terms of a ratio between number of total requests and timed out requests"

@achave11-ucsc
Copy link
Member Author

From 08.01.23 to 09.07.23, the following number of azul-service-anvilprod-servicecachehealth executions failed…

	 [
    {
        "bin(24h)": "2023-09-05 00:00:00.000",
        "count(@requestId)": "121"
    },
    {
        "bin(24h)": "2023-08-30 00:00:00.000",
        "count(@requestId)": "2"
    },
    {
        "bin(24h)": "2023-08-29 00:00:00.000",
        "count(@requestId)": "1"
    },
    {
        "bin(24h)": "2023-08-24 00:00:00.000",
        "count(@requestId)": "114"
    },
    {
        "bin(24h)": "2023-08-11 00:00:00.000",
        "count(@requestId)": "2"
    },
    {
        "bin(24h)": "2023-08-08 00:00:00.000",
        "count(@requestId)": "1"
    }
]

… note that, approximately 1440 service/indexer cachehealth executions run on a daily basis, with each execution making five service requests (one per entity endpoint).
For the same timeframe, the following service executions timeout…

	[
    {
        "bin(24h)": "2023-09-05 00:00:00.000",
        "count(@requestId)": "2"
    },
    {
        "bin(24h)": "2023-08-30 00:00:00.000",
        "count(@requestId)": "2"
    },
    {
        "bin(24h)": "2023-08-10 00:00:00.000",
        "count(@requestId)": "1"
    },
    {
        "bin(24h)": "2023-08-09 00:00:00.000",
        "count(@requestId)": "3"
    },
    {
        "bin(24h)": "2023-08-08 00:00:00.000",
        "count(@requestId)": "3"
    }
]

… and these, the count of the service executions that took longer than 10s …

[
    {
        "bin(24h)": "2023-09-05 00:00:00.000",
        "count(@requestId)": "590"
    },
    {
        "bin(24h)": "2023-08-30 00:00:00.000",
        "count(@requestId)": "2"
    },
    {
        "bin(24h)": "2023-08-24 00:00:00.000",
        "count(@requestId)": "572"
    },
    {
        "bin(24h)": "2023-08-18 00:00:00.000",
        "count(@requestId)": "1"
    },
    {
        "bin(24h)": "2023-08-14 00:00:00.000",
        "count(@requestId)": "1"
    },
    {
        "bin(24h)": "2023-08-11 00:00:00.000",
        "count(@requestId)": "9"
    },
    {
        "bin(24h)": "2023-08-10 00:00:00.000",
        "count(@requestId)": "1"
    },
    {
        "bin(24h)": "2023-08-09 00:00:00.000",
        "count(@requestId)": "4"
    },
    {
        "bin(24h)": "2023-08-08 00:00:00.000",
        "count(@requestId)": "6"
    }
]

@dsotirho-ucsc
Copy link
Contributor

@hannes-ucsc: "I'd like to tackle this by optimizing the cachehealth Lambda and our retry policy with the goal of reducing the number of false alarms to an acceptably low level: I don't think we need to make a service request for every entity type, (which also eliminates the need for a thread pool) and we shouldn't retry on read when the timing is restricted. The use of a thread pool for making service requests in parallel also subverts our caching of accessible sources."

@dsotirho-ucsc dsotirho-ucsc added bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts + [priority] High and removed + [priority] High labels Sep 8, 2023
@achave11-ucsc achave11-ucsc added the noise [subject] Causing many false alarms label Sep 19, 2023
achave11-ucsc added a commit that referenced this issue Sep 29, 2023
@achave11-ucsc
Copy link
Member Author

Assignee to consider next steps in light of updated description.

@hannes-ucsc
Copy link
Member

Looking at the invocation that timed out, the one from the updated description: Memory is consistently high and close to the maximum allowed memory. The execution context is fairly old when the invocation timed out ("Init Duration" in the REPORT message marks the first invocation in an execution context). The invocation immediately following the timeout also takes quite a long time (10s). It is possible that the Lambda was just GCing (and, as part of that, calling destructors with side effects, potentially for a large number of objects) when the the timeout occurred. That may be an indication of a some sort of resource leak in the Lambda.

image

Now that the timing is more relaxed the prevalence of these timeouts may be much reduced. If this occurs again and frequently enough we should consider bumping the max allowed memory to, say, 196 MiB. We could also try initiating GC at the beginning of every invocation.

@hannes-ucsc hannes-ucsc removed their assignment Jan 12, 2024
@achave11-ucsc
Copy link
Member Author

@hannes-ucsc: "@achave11-ucsc and I research another occurrence. We looked at the S3 data events in the trail and could not find a data event for the boto3 PUT request that timed out. We also looked at the VPC flow logs between the Lambda and the NAT Gateway as well as the NAT Gateway and the external IPs but could not find anything out of the ordinary, specifically no dropped or rejected connection. While it's still possible that the network is flaky, the garbage collection hypothesis is more likely."

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Jan 18, 2024

Actually, googling around, there are reports of intermittent connection errors between a Lambda function in a VPC and S3, and of the fact that enabling a VPC endpoint for S3 fixes them. Currently, all traffic from our Lambda functions and AWS services is routed via a NAT gateway and that seems to be causing the flakiness.

There are two types of endpoints to choose from for S3, gateway endpoints and interface endpoints. The former operate at the routing level and, unlike the latter, don't require configuring clients to use a dedicated hostname for the endpoint. It appears that interface endpoints function like load balancers or proxies and they do require special configuration for clients.

Gateway endpoints are free, interface endpoints are not, but still four times cheaper than NAT gateways. On the other hand, gateway endpoints are currently only available for S3 and DynamoDB. We should enable gateway endpoints for both to resolve this issue, and follow up with enabling interface endpoints for other AWS services heavily used by our Lambda functions, SQS comes to mind. We should also double check that traffic between the Lambda functions and the ES domains aren't routed through the NAT gateway.

@hannes-ucsc
Copy link
Member

On the other hand, gateway endpoints are currently only available for S3 and DynamoDB. We should enable gateway endpoints for both to resolve this issue,

Assignee to do that.

We should also double check that traffic between the Lambda functions and the ES domains aren't routed through the NAT gateway.

Assignee to run the VPC Reachability Analyzer for that. They may need to coordinate with me for assistance.

@achave11-ucsc
Copy link
Member Author

VPC Reachability Analyzer findings:

The analyzer was ran from the Lambda (in personal deployment) directly to the IP from the ES sandbox instance. The IP for the ES instance was obtained by running the host command on the VPC domain endpoint.

$ host vpc-azul-index-sandbox-mcwjphhhdivigzrsrdmxm2uude.us-east-1.es.amazonaws.com
vpc-azul-index-sandbox-mcwjphhhdivigzrsrdmxm2uude.us-east-1.es.amazonaws.com has address 172.71.0.63
vpc-azul-index-sandbox-mcwjphhhdivigzrsrdmxm2uude.us-east-1.es.amazonaws.com has address 172.71.2.113

The results exhibit the desired behavior, traffic between Lambda and ES domain not being routed through the NAT gateway:

aws ec2 describe-network-insights-analyses --network-insights-analysis-id nia-0579c6e281a3d9644
{
    "NetworkInsightsAnalyses": [
        {
            "NetworkInsightsAnalysisId": "nia-0579c6e281a3d9644",
            "NetworkInsightsAnalysisArn": "arn:aws:ec2:us-east-1:122796619775:network-insights-analysis/nia-0579c6e281a3d9644",
            "NetworkInsightsPathId": "nip-0cea7cb0c1d9a3398",
            "StartDate": "2024-01-30T20:44:11.674Z",
            "Status": "succeeded",
            "NetworkPathFound": true,
            "ForwardPathComponents": [
                {
                    "SequenceNumber": 1,
                    "Component": {
                        "Id": "eni-00ea1974cae084df4",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-00ea1974cae084df4"
                    },
                    "OutboundHeader": {
                        "DestinationAddresses": [
                            "172.71.2.113/32"
                        ],
                        "DestinationPortRanges": [
                            {
                                "From": 443,
                                "To": 443
                            }
                        ],
                        "Protocol": "6",
                        "SourceAddresses": [
                            "172.71.2.109/32"
                        ],
                        "SourcePortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ]
                    },
                    "Subnet": {
                        "Id": "subnet-03e66f953199f819b",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-03e66f953199f819b",
                        "Name": "azul-gitlab_private_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                },
                {
                    "SequenceNumber": 2,
                    "Component": {
                        "Id": "sg-016f9bece0ab8f7c7",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:security-group/sg-016f9bece0ab8f7c7",
                        "Name": "azul-service-abrahamsc"
                    },
                    "SecurityGroupRule": {
                        "Cidr": "0.0.0.0/0",
                        "Direction": "egress",
                        "Protocol": "all"
                    }
                },
                {
                    "SequenceNumber": 3,
                    "Component": {
                        "Id": "sg-074d6489649695b48",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:security-group/sg-074d6489649695b48",
                        "Name": "azul-elasticsearch-sandbox"
                    },
                    "SecurityGroupRule": {
                        "Cidr": "172.71.0.0/16",
                        "Direction": "ingress",
                        "PortRange": {
                            "From": 443,
                            "To": 443
                        },
                        "Protocol": "tcp"
                    }
                },
                {
                    "SequenceNumber": 4,
                    "Component": {
                        "Id": "eni-0607e9a8015369a0e",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-0607e9a8015369a0e"
                    },
                    "InboundHeader": {
                        "DestinationAddresses": [
                            "172.71.2.113/32"
                        ],
                        "DestinationPortRanges": [
                            {
                                "From": 443,
                                "To": 443
                            }
                        ],
                        "Protocol": "6",
                        "SourceAddresses": [
                            "172.71.2.109/32"
                        ],
                        "SourcePortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ]
                    },
                    "Subnet": {
                        "Id": "subnet-03e66f953199f819b",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-03e66f953199f819b",
                        "Name": "azul-gitlab_private_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                }
            ],
            "ReturnPathComponents": [
                {
                    "SequenceNumber": 1,
                    "Component": {
                        "Id": "eni-0607e9a8015369a0e",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-0607e9a8015369a0e"
                    },
                    "OutboundHeader": {
                        "DestinationAddresses": [
                            "172.71.2.109/32"
                        ],
                        "DestinationPortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ],
                        "Protocol": "6",
                        "SourceAddresses": [
                            "172.71.2.113/32"
                        ],
                        "SourcePortRanges": [
                            {
                                "From": 443,
                                "To": 443
                            }
                        ]
                    },
                    "Subnet": {
                        "Id": "subnet-03e66f953199f819b",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-03e66f953199f819b",
                        "Name": "azul-gitlab_private_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                },
                {
                    "SequenceNumber": 2,
                    "Component": {
                        "Id": "sg-074d6489649695b48",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:security-group/sg-074d6489649695b48",
                        "Name": "azul-elasticsearch-sandbox"
                    }
                },
                {
                    "SequenceNumber": 3,
                    "Component": {
                        "Id": "sg-016f9bece0ab8f7c7",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:security-group/sg-016f9bece0ab8f7c7",
                        "Name": "azul-service-abrahamsc"
                    }
                },
                {
                    "SequenceNumber": 4,
                    "Component": {
                        "Id": "eni-00ea1974cae084df4",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-00ea1974cae084df4"
                    },
                    "InboundHeader": {
                        "DestinationAddresses": [
                            "172.71.2.109/32"
                        ],
                        "DestinationPortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ],
                        "Protocol": "6",
                        "SourceAddresses": [
                            "172.71.2.113/32"
                        ],
                        "SourcePortRanges": [
                            {
                                "From": 443,
                                "To": 443
                            }
                        ]
                    },
                    "Subnet": {
                        "Id": "subnet-03e66f953199f819b",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-03e66f953199f819b",
                        "Name": "azul-gitlab_private_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                }
            ],
            "Tags": []
        }
    ]
}

For good measure, we also analyzed the behavior of the connection between the Lambda and a known public IP (8.8.8.8) which exhibits the behavior of the traffic being routed through the NAT:

aws ec2 describe-network-insights-analyses --network-insights-analysis-id nia-004e5d470a71d66d3
{
    "NetworkInsightsAnalyses": [
        {
            "NetworkInsightsAnalysisId": "nia-004e5d470a71d66d3",
            "NetworkInsightsAnalysisArn": "arn:aws:ec2:us-east-1:122796619775:network-insights-analysis/nia-004e5d470a71d66d3",
            "NetworkInsightsPathId": "nip-0da3b654edc465468",
            "StartDate": "2024-01-30T20:47:13.330Z",
            "Status": "succeeded",
            "NetworkPathFound": true,
            "ForwardPathComponents": [
                {
                    "SequenceNumber": 1,
                    "Component": {
                        "Id": "eni-00ea1974cae084df4",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-00ea1974cae084df4"
                    },
                    "OutboundHeader": {
                        "DestinationAddresses": [
                            "8.8.8.8/32"
                        ],
                        "DestinationPortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ],
                        "Protocol": "6",
                        "SourceAddresses": [
                            "172.71.2.109/32"
                        ],
                        "SourcePortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ]
                    },
                    "Subnet": {
                        "Id": "subnet-03e66f953199f819b",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-03e66f953199f819b",
                        "Name": "azul-gitlab_private_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                },
                {
                    "SequenceNumber": 2,
                    "Component": {
                        "Id": "sg-016f9bece0ab8f7c7",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:security-group/sg-016f9bece0ab8f7c7",
                        "Name": "azul-service-abrahamsc"
                    },
                    "SecurityGroupRule": {
                        "Cidr": "0.0.0.0/0",
                        "Direction": "egress",
                        "Protocol": "all"
                    }
                },
                {
                    "SequenceNumber": 3,
                    "AclRule": {
                        "Cidr": "0.0.0.0/0",
                        "Egress": true,
                        "Protocol": "all",
                        "RuleAction": "allow",
                        "RuleNumber": 100
                    },
                    "Component": {
                        "Id": "acl-05c982d914728ee6a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-acl/acl-05c982d914728ee6a"
                    }
                },
                {
                    "SequenceNumber": 4,
                    "Component": {
                        "Id": "rtb-03d5e85affa440cbe",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:route-table/rtb-03d5e85affa440cbe",
                        "Name": "azul-gitlab_1"
                    },
                    "RouteTableRoute": {
                        "DestinationCidr": "0.0.0.0/0",
                        "NatGatewayId": "nat-065371190014e1037",
                        "Origin": "createroute",
                        "State": "active"
                    }
                },
                {
                    "SequenceNumber": 5,
                    "AclRule": {
                        "Cidr": "0.0.0.0/0",
                        "Egress": false,
                        "Protocol": "all",
                        "RuleAction": "allow",
                        "RuleNumber": 100
                    },
                    "Component": {
                        "Id": "acl-05c982d914728ee6a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-acl/acl-05c982d914728ee6a"
                    }
                },
                {
                    "SequenceNumber": 6,
                    "AttachedTo": {
                        "Id": "nat-065371190014e1037",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:natgateway/nat-065371190014e1037"
                    },
                    "Component": {
                        "Id": "eni-0becd601af14ac6eb",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-0becd601af14ac6eb"
                    },
                    "Subnet": {
                        "Id": "subnet-06a1e10523948ab2a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-06a1e10523948ab2a",
                        "Name": "azul-gitlab_public_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                },
                {
                    "SequenceNumber": 7,
                    "Component": {
                        "Id": "nat-065371190014e1037",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:natgateway/nat-065371190014e1037"
                    },
                    "OutboundHeader": {
                        "SourceAddresses": [
                            "172.71.3.245/32"
                        ],
                        "SourcePortRanges": [
                            {
                                "From": 1024,
                                "To": 65535
                            }
                        ]
                    }
                },
                {
                    "SequenceNumber": 8,
                    "AttachedTo": {
                        "Id": "nat-065371190014e1037",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:natgateway/nat-065371190014e1037"
                    },
                    "Component": {
                        "Id": "eni-0becd601af14ac6eb",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-0becd601af14ac6eb"
                    },
                    "Subnet": {
                        "Id": "subnet-06a1e10523948ab2a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-06a1e10523948ab2a",
                        "Name": "azul-gitlab_public_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                },
                {
                    "SequenceNumber": 9,
                    "AclRule": {
                        "Cidr": "0.0.0.0/0",
                        "Egress": true,
                        "Protocol": "all",
                        "RuleAction": "allow",
                        "RuleNumber": 100
                    },
                    "Component": {
                        "Id": "acl-05c982d914728ee6a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-acl/acl-05c982d914728ee6a"
                    }
                },
                {
                    "SequenceNumber": 10,
                    "Component": {
                        "Id": "rtb-0bf39b4e467da13db",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:route-table/rtb-0bf39b4e467da13db"
                    },
                    "RouteTableRoute": {
                        "DestinationCidr": "0.0.0.0/0",
                        "GatewayId": "igw-0f4231a8570673a3d",
                        "Origin": "createroute",
                        "State": "active"
                    }
                },
                {
                    "SequenceNumber": 11,
                    "Component": {
                        "Id": "igw-0f4231a8570673a3d",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:internet-gateway/igw-0f4231a8570673a3d",
                        "Name": "azul-gitlab"
                    },
                    "OutboundHeader": {
                        "DestinationAddresses": [
                            "8.8.8.8/32"
                        ],
                        "DestinationPortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ],
                        "Protocol": "6",
                        "SourceAddresses": [
                            "3.214.122.182/32"
                        ],
                        "SourcePortRanges": [
                            {
                                "From": 1024,
                                "To": 65535
                            }
                        ]
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                }
            ],
            "ReturnPathComponents": [
                {
                    "SequenceNumber": 1,
                    "Component": {
                        "Id": "igw-0f4231a8570673a3d",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:internet-gateway/igw-0f4231a8570673a3d",
                        "Name": "azul-gitlab"
                    },
                    "OutboundHeader": {
                        "DestinationAddresses": [
                            "172.71.3.245/32"
                        ]
                    },
                    "InboundHeader": {
                        "DestinationAddresses": [
                            "3.214.122.182/32"
                        ],
                        "DestinationPortRanges": [
                            {
                                "From": 1024,
                                "To": 65535
                            }
                        ],
                        "Protocol": "6",
                        "SourceAddresses": [
                            "8.8.8.8/32"
                        ],
                        "SourcePortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ]
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                },
                {
                    "SequenceNumber": 2,
                    "AclRule": {
                        "Cidr": "0.0.0.0/0",
                        "Egress": false,
                        "Protocol": "all",
                        "RuleAction": "allow",
                        "RuleNumber": 100
                    },
                    "Component": {
                        "Id": "acl-05c982d914728ee6a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-acl/acl-05c982d914728ee6a"
                    }
                },
                {
                    "SequenceNumber": 3,
                    "AttachedTo": {
                        "Id": "nat-065371190014e1037",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:natgateway/nat-065371190014e1037"
                    },
                    "Component": {
                        "Id": "eni-0becd601af14ac6eb",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-0becd601af14ac6eb"
                    },
                    "Subnet": {
                        "Id": "subnet-06a1e10523948ab2a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-06a1e10523948ab2a",
                        "Name": "azul-gitlab_public_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                },
                {
                    "SequenceNumber": 4,
                    "Component": {
                        "Id": "nat-065371190014e1037",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:natgateway/nat-065371190014e1037"
                    },
                    "OutboundHeader": {
                        "DestinationAddresses": [
                            "172.71.2.109/32"
                        ],
                        "DestinationPortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ]
                    }
                },
                {
                    "SequenceNumber": 5,
                    "AttachedTo": {
                        "Id": "nat-065371190014e1037",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:natgateway/nat-065371190014e1037"
                    },
                    "Component": {
                        "Id": "eni-0becd601af14ac6eb",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-0becd601af14ac6eb"
                    },
                    "Subnet": {
                        "Id": "subnet-06a1e10523948ab2a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-06a1e10523948ab2a",
                        "Name": "azul-gitlab_public_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                },
                {
                    "SequenceNumber": 6,
                    "AclRule": {
                        "Cidr": "0.0.0.0/0",
                        "Egress": true,
                        "Protocol": "all",
                        "RuleAction": "allow",
                        "RuleNumber": 100
                    },
                    "Component": {
                        "Id": "acl-05c982d914728ee6a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-acl/acl-05c982d914728ee6a"
                    }
                },
                {
                    "SequenceNumber": 7,
                    "Component": {
                        "Id": "rtb-0bf39b4e467da13db",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:route-table/rtb-0bf39b4e467da13db"
                    },
                    "RouteTableRoute": {
                        "DestinationCidr": "172.71.0.0/16",
                        "GatewayId": "local",
                        "Origin": "createroutetable",
                        "State": "active"
                    }
                },
                {
                    "SequenceNumber": 8,
                    "AclRule": {
                        "Cidr": "0.0.0.0/0",
                        "Egress": false,
                        "Protocol": "all",
                        "RuleAction": "allow",
                        "RuleNumber": 100
                    },
                    "Component": {
                        "Id": "acl-05c982d914728ee6a",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-acl/acl-05c982d914728ee6a"
                    }
                },
                {
                    "SequenceNumber": 9,
                    "Component": {
                        "Id": "sg-016f9bece0ab8f7c7",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:security-group/sg-016f9bece0ab8f7c7",
                        "Name": "azul-service-abrahamsc"
                    }
                },
                {
                    "SequenceNumber": 10,
                    "Component": {
                        "Id": "eni-00ea1974cae084df4",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:network-interface/eni-00ea1974cae084df4"
                    },
                    "InboundHeader": {
                        "DestinationAddresses": [
                            "172.71.2.109/32"
                        ],
                        "DestinationPortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ],
                        "Protocol": "6",
                        "SourceAddresses": [
                            "8.8.8.8/32"
                        ],
                        "SourcePortRanges": [
                            {
                                "From": 0,
                                "To": 65535
                            }
                        ]
                    },
                    "Subnet": {
                        "Id": "subnet-03e66f953199f819b",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:subnet/subnet-03e66f953199f819b",
                        "Name": "azul-gitlab_private_1"
                    },
                    "Vpc": {
                        "Id": "vpc-03aeb4a3542c7a5d3",
                        "Arn": "arn:aws:ec2:us-east-1:122796619775:vpc/vpc-03aeb4a3542c7a5d3",
                        "Name": "azul-gitlab"
                    }
                }
            ],
            "Tags": []
        }
    ]
}

@achave11-ucsc
Copy link
Member Author

Screenshot 2024-02-20 at 2 45 50 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
+ [priority] High bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms orange [process] Done by the Azul team spike:1 [process] Spike estimate of one point
Projects
None yet
Development

No branches or pull requests

3 participants