Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection reset when health check makes HEAD request to service endpoint #5702

Closed
14 tasks
achave11-ucsc opened this issue Nov 16, 2023 · 9 comments
Closed
14 tasks
Assignees
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team spike:1 [process] Spike estimate of one point wontfix [process] Issue is valid but resolution is not deemed necessary

Comments

@achave11-ucsc
Copy link
Member

achave11-ucsc commented Nov 16, 2023

… with incidents in dev and anvildev, where the last execution log is the HEAD request by servicecachehealth to Azul service endpoints that never returns and results in a silent execution timeout. Additionally, API Gateway and service Lambda execution logs in these two deployments lack evidence of these HEAD requests ever being received.

API Gateway, Service and Service Cache Health Lambda execution logs for `dev` :

And note that there are no entries for API Gateway or Service dev.

CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-dev-servicecachehealth, /aws/lambda/azul-service-dev, /aws/apigateway/azul-service-dev, API-Gateway-Execution-Logs_8qodesspsa/dev, API-Gateway-Execution-Logs_ann5yskrli/dev, API-Gateway-Execution-Logs_b8ddywgc9a/dev, API-Gateway-Execution-Logs_q4pfvvk389/dev
start-time: 2023-11-16T15:07:45.312Z
end-time: 2023-11-16T15:08:06.410Z
query-string:

fields @timestamp, @message
# | filter @requestId like /784a2496/
| sort @timestamp asc
| limit 200

@timestamp @message
2023-11-16 15:07:49.703 START RequestId: 784a2496-e31d-44b3-aedc-06463323d6c5 Version: $LATEST
2023-11-16 15:07:49.703 [INFO] 2023-11-16T15:07:49.703Z 784a2496-e31d-44b3-aedc-06463323d6c5 azul.health Getting health property 'api_endpoints'
2023-11-16 15:07:49.704 [INFO] 2023-11-16T15:07:49.704Z 784a2496-e31d-44b3-aedc-06463323d6c5 azul.health Making HEAD request to https://service.dev.singlecell.gi.ucsc.edu/index/files?size=1
2023-11-16 15:07:59.776 2023-11-16T15:07:59.776Z 784a2496-e31d-44b3-aedc-06463323d6c5 Task timed out after 10.07 seconds
2023-11-16 15:07:59.776 END RequestId: 784a2496-e31d-44b3-aedc-06463323d6c5
2023-11-16 15:07:59.776 REPORT RequestId: 784a2496-e31d-44b3-aedc-06463323d6c5 Duration: 10072.78 ms Billed Duration: 10000 ms Memory Size: 128 MB Max Memory Used: 122 MB
2023-11-16 15:07:59.820 INIT_START Runtime Version: python:3.11.v18 Runtime Version ARN: arn:aws:lambda:us-east-1::runtime:6ebff6b58cf714d30879a40cc31554cd1bbc242da7bf75a000bc0c3052c6ebbc

A more informative version of this incident happened in anvilprod, where it didn't fail silently by timing out but actually produced error logs in the servicecachehealth execution.

API Gateway, Service and Service Cache Health Lambda execution logs for `anvilprod` :

@timestamp @message
2023-11-16 13:53:38.003 START RequestId: fe35b3b3-3ffa-45a7-8804-11659b044469 Version: $LATEST
2023-11-16 13:53:38.003 [INFO] 2023-11-16T13:53:38.003Z fe35b3b3-3ffa-45a7-8804-11659b044469 azul.health Getting health property 'elasticsearch'
2023-11-16 13:53:38.003 [INFO] 2023-11-16T13:53:38.003Z fe35b3b3-3ffa-45a7-8804-11659b044469 elasticsearch Making HEAD request to https://vpc-azul-index-anvilprod-ggipah4skn2ftt47u4xgvydzqm.us-east-1.es.amazonaws.com:443/
2023-11-16 13:53:38.003 [INFO] 2023-11-16T13:53:38.003Z fe35b3b3-3ffa-45a7-8804-11659b044469 elasticsearch … without request body
2023-11-16 13:53:38.125 [INFO] 2023-11-16T13:53:38.125Z fe35b3b3-3ffa-45a7-8804-11659b044469 elasticsearch Got 200 response after 0.121s from HEAD to https://vpc-azul-index-anvilprod-ggipah4skn2ftt47u4xgvydzqm.us-east-1.es.amazonaws.com:443/
2023-11-16 13:53:38.125 [INFO] 2023-11-16T13:53:38.125Z fe35b3b3-3ffa-45a7-8804-11659b044469 elasticsearch … with response body ''
2023-11-16 13:53:38.125 [INFO] 2023-11-16T13:53:38.125Z fe35b3b3-3ffa-45a7-8804-11659b044469 azul.health Getting health property 'api_endpoints'
2023-11-16 13:53:38.126 [INFO] 2023-11-16T13:53:38.126Z fe35b3b3-3ffa-45a7-8804-11659b044469 azul.health Making HEAD request to https://service.prod.anvil.gi.ucsc.edu/index/activities?size=1
2023-11-16 13:53:38.369 [ERROR] ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Traceback (most recent call last):
  File "/var/task/azul/chalice.py", line 166, in patched_event_source_handler
    return old_handler(self_, event, context)
  File "/var/task/chalice/app.py", line 1753, in __call__
    return self.handler(event_obj)
  File "/var/task/app.py", line 549, in update_health_cache
    app.health_controller.update_cache()
  File "/var/task/azul/health.py", line 138, in update_cache
    health_object = dict(time=time.time(), health=self._health.as_json_fast())
  File "/var/task/azul/health.py", line 308, in as_json_fast
    return self.as_json(p.key for p in self.fast_properties[self.lambda_name])
  File "/var/task/azul/health.py", line 181, in as_json
    json = {k: getattr(self, k) for k in keys}
  File "/var/task/azul/health.py", line 181, in <dictcomp>
    json = {k: getattr(self, k) for k in keys}
  File "/var/task/azul/health.py", line 73, in __get__
    return super().__get__(obj, objtype=objtype)
  File "/var/task/azul/caching.py", line 189, in __get__
    value = obj.__dict__[self.fget.__name__] = self.fget(obj)
  File "/var/task/azul/health.py", line 265, in api_endpoints
    return self._api_endpoint(entity_type)
  File "/var/task/azul/health.py", line 245, in _api_endpoint
    response = requests.head(url)
  File "/opt/python/requests/api.py", line 100, in head
    return request("head", url, **kwargs)
  File "/opt/python/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/python/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/python/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/python/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
2023-11-16 13:53:38.387 END RequestId: fe35b3b3-3ffa-45a7-8804-11659b044469
2023-11-16 13:53:38.387 REPORT RequestId: fe35b3b3-3ffa-45a7-8804-11659b044469 Duration: 384.35 ms Billed Duration: 385 ms Memory Size: 128 MB Max Memory Used: 121 MB

  • Security design review completed; the Resolution of this issue does not
    • … affect authentication; for example:
      • OAuth 2.0 with the application (API or Swagger UI)
      • Authentication of developers with Google Cloud APIs
      • Authentication of developers with AWS APIs
      • Authentication with a GitLab instance in the system
      • Password and 2FA authentication with GitHub
      • API access token authentication with GitHub
      • Authentication with
    • … affect the permissions of internal users like access to
      • Cloud resources on AWS and GCP
      • GitLab repositories, projects and groups, administration
      • an EC2 instance via SSH
      • GitHub issues, pull requests, commits, commit statuses, wikis, repositories, organizations
    • … affect the permissions of external users like access to
      • TDR snapshots
    • … affect permissions of service or bot accounts
      • Cloud resources on AWS and GCP
    • … affect audit logging in the system, like
      • adding, removing or changing a log message that represents an auditable event
      • changing the routing of log messages through the system
    • … affect monitoring of the system
    • … introduce a new software dependency like
      • Python packages on PYPI
      • Command-line utilities
      • Docker images
      • Terraform providers
    • … add an interface that exposes sensitive or confidential data at the security boundary
    • … affect the encryption of data at rest
    • … require persistence of sensitive or confidential data that might require encryption at rest
    • … require unencrypted transmission of data within the security boundary
    • … affect the network security layer; for example by
      • modifying, adding or removing firewall rules
      • modifying, adding or removing security groups
      • changing or adding a port a service, proxy or load balancer listens on
  • Documentation on any unchecked boxes is provided in comments below
@achave11-ucsc achave11-ucsc added the orange [process] Done by the Azul team label Nov 16, 2023
@dsotirho-ucsc dsotirho-ucsc added bug [type] A defect preventing use of the system as specified infra [subject] Project infrastructure like CI/CD, build and deployment scripts debt [type] A defect incurring continued engineering cost - [priority] Medium labels Nov 20, 2023
@achave11-ucsc
Copy link
Member Author

Assignee to consider next steps.

Note:
Stale connection, retry

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Dec 21, 2023

The two occurrences in the description have very different symptoms, and therefore different causes. We should break this up into two different issues and track every occurrence so we get an idea of the prevalence.

Furthermore, while the second occurrence (the "Connection reset by peer" one) look like a stale connection—the stack trace shows that requests.head() is used—and that method allocates a fresh urllib3 connection pool every time it is called. IOW, it can't be a stale connection.

@hannes-ucsc hannes-ucsc removed their assignment Dec 21, 2023
@dsotirho-ucsc
Copy link
Contributor

@hannes-ucsc: "We edited the description to remove the Lambda timeout which is already tracked as #5467. Looking at the alarms, the prevalence of the connection error (which we'll continue to track here) is fairly low (only one occurrence on Nov 16)."

@dsotirho-ucsc dsotirho-ucsc changed the title Service requests from servicecachehealth check are being dropped Connection reset when health check makes HEAD request to service endpoint Dec 21, 2023
@achave11-ucsc
Copy link
Member Author

Another occurrence,

ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) raised while the api_endpoints health property was being populated, the service /bundles endpoint request was in progress.

[
    {
        "@timestamp": "2023-12-26 11:04:37.795",
        "@message": "START RequestId: 3005638f-8fd1-44ee-bd3b-6fd5feecccd3 Version: $LATEST\n",
        "@logStream": "2023/12/26/[$LATEST]b35979c8e40344888bc033d8709b9ee9",
        "@log": "465330168186:/aws/lambda/azul-service-anvilprod-servicecachehealth"
    },
    {
        "@timestamp": "2023-12-26 11:04:37.796",
        "@message": "[INFO]\t2023-12-26T11:04:37.796Z\t3005638f-8fd1-44ee-bd3b-6fd5feecccd3\tazul.health\tGetting health property 'api_endpoints'",
        "@logStream": "2023/12/26/[$LATEST]b35979c8e40344888bc033d8709b9ee9",
        "@log": "465330168186:/aws/lambda/azul-service-anvilprod-servicecachehealth"
    },
    {
        "@timestamp": "2023-12-26 11:04:37.796",
        "@message": "[INFO]\t2023-12-26T11:04:37.796Z\t3005638f-8fd1-44ee-bd3b-6fd5feecccd3\tazul.health\tMaking HEAD request to https://service.prod.anvil.gi.ucsc.edu/index/bundles?size=1",
        "@logStream": "2023/12/26/[$LATEST]b35979c8e40344888bc033d8709b9ee9",
        "@log": "465330168186:/aws/lambda/azul-service-anvilprod-servicecachehealth"
    },
    {
        "@timestamp": "2023-12-26 11:04:38.101",
        "@message": "[ERROR] ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))\nTraceback (most recent call last):\n  File \"/var/task/azul/chalice.py\", line 166, in patched_event_source_handler\n    return old_handler(self_, event, context)\n  File \"/var/task/chalice/app.py\", line 1753, in __call__\n    return self.handler(event_obj)\n  File \"/var/task/app.py\", line 549, in update_health_cache\n    app.health_controller.update_cache()\n  File \"/var/task/azul/health.py\", line 138, in update_cache\n    health_object = dict(time=time.time(), health=self._health.as_json_fast())\n  File \"/var/task/azul/health.py\", line 308, in as_json_fast\n    return self.as_json(p.key for p in self.fast_properties[self.lambda_name])\n  File \"/var/task/azul/health.py\", line 181, in as_json\n    json = {k: getattr(self, k) for k in sorted(keys)}\n  File \"/var/task/azul/health.py\", line 181, in <dictcomp>\n    json = {k: getattr(self, k) for k in sorted(keys)}\n  File \"/var/task/azul/health.py\", line 73, in __get__\n    return super().__get__(obj, objtype=objtype)\n  File \"/var/task/azul/caching.py\", line 189, in __get__\n    value = obj.__dict__[self.fget.__name__] = self.fget(obj)\n  File \"/var/task/azul/health.py\", line 265, in api_endpoints\n    return self._api_endpoint(entity_type)\n  File \"/var/task/azul/health.py\", line 245, in _api_endpoint\n    response = requests.head(url)\n  File \"/opt/python/requests/api.py\", line 100, in head\n    return request(\"head\", url, **kwargs)\n  File \"/opt/python/requests/api.py\", line 59, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/python/requests/sessions.py\", line 589, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/python/requests/sessions.py\", line 703, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/python/requests/adapters.py\", line 501, in send\n    raise ConnectionError(err, request=request)",
        "@logStream": "2023/12/26/[$LATEST]b35979c8e40344888bc033d8709b9ee9",
        "@log": "465330168186:/aws/lambda/azul-service-anvilprod-servicecachehealth"
    },
    {
        "@timestamp": "2023-12-26 11:04:38.140",
        "@message": "END RequestId: 3005638f-8fd1-44ee-bd3b-6fd5feecccd3\n",
        "@logStream": "2023/12/26/[$LATEST]b35979c8e40344888bc033d8709b9ee9",
        "@log": "465330168186:/aws/lambda/azul-service-anvilprod-servicecachehealth"
    },
    {
        "@timestamp": "2023-12-26 11:04:38.140",
        "@message": "REPORT RequestId: 3005638f-8fd1-44ee-bd3b-6fd5feecccd3\tDuration: 344.14 ms\tBilled Duration: 345 ms\tMemory Size: 128 MB\tMax Memory Used: 123 MB\t\n",
        "@logStream": "2023/12/26/[$LATEST]b35979c8e40344888bc033d8709b9ee9",
        "@log": "465330168186:/aws/lambda/azul-service-anvilprod-servicecachehealth"
    }
]

@dsotirho-ucsc
Copy link
Contributor

anvilprod

[
    {
        "@timestamp": "2024-01-28 10:27:37.775",
        "@message": "START RequestId: ff0be84e-8653-486d-b0ac-a3aaa02ea76e Version: $LATEST\n"
    },
    {
        "@timestamp": "2024-01-28 10:27:37.775",
        "@message": "[INFO]\t2024-01-28T10:27:37.775Z\tff0be84e-8653-486d-b0ac-a3aaa02ea76e\tazul.health\tGetting health property 'api_endpoints'"
    },
    {
        "@timestamp": "2024-01-28 10:27:37.776",
        "@message": "[INFO]\t2024-01-28T10:27:37.776Z\tff0be84e-8653-486d-b0ac-a3aaa02ea76e\tazul.health\tMaking HEAD request to https://service.prod.anvil.gi.ucsc.edu/index/activities?size=1"
    },
    {
        "@timestamp": "2024-01-28 10:27:38.017",
        "@message": "[ERROR] ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Traceback (most recent call last):
  File "/var/task/azul/chalice.py", line 166, in patched_event_source_handler
    return old_handler(self_, event, context)
  File "/var/task/chalice/app.py", line 1756, in __call__
    return self.handler(event_obj)
  File "/var/task/app.py", line 550, in update_health_cache
    app.health_controller.update_cache()
  File "/var/task/azul/health.py", line 138, in update_cache
    health_object = dict(time=time.time(), health=self._health.as_json_fast())
  File "/var/task/azul/health.py", line 308, in as_json_fast
    return self.as_json(p.key for p in self.fast_properties[self.lambda_name])
  File "/var/task/azul/health.py", line 181, in as_json
    json = {k: getattr(self, k) for k in sorted(keys)}
  File "/var/task/azul/health.py", line 181, in <dictcomp>
    json = {k: getattr(self, k) for k in sorted(keys)}
  File "/var/task/azul/health.py", line 73, in __get__
    return super().__get__(obj, objtype=objtype)
  File "/var/task/azul/caching.py", line 189, in __get__
    value = obj.__dict__[self.fget.__name__] = self.fget(obj)
  File "/var/task/azul/health.py", line 265, in api_endpoints
    return self._api_endpoint(entity_type)
  File "/var/task/azul/health.py", line 245, in _api_endpoint
    response = requests.head(url)
  File "/opt/python/requests/api.py", line 100, in head
    return request("head", url, **kwargs)
  File "/opt/python/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/python/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/python/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/python/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)"
    },
    {
        "@timestamp": "2024-01-28 10:27:38.020",
        "@message": "END RequestId: ff0be84e-8653-486d-b0ac-a3aaa02ea76e\n"
    },
    {
        "@timestamp": "2024-01-28 10:27:38.020",
        "@message": "REPORT RequestId: ff0be84e-8653-486d-b0ac-a3aaa02ea76e\tDuration: 245.44 ms\tBilled Duration: 246 ms\tMemory Size: 128 MB\tMax Memory Used: 121 MB\t\n"
    },
]

@dsotirho-ucsc
Copy link
Contributor

@hannes-ucsc: "We thought this might correlate with deploy jobs but it doesn't:"
image

@dsotirho-ucsc
Copy link
Contributor

@hannes-ucsc: "Considering that there are no log entries by API Gateway or the receiving lambda function (see description), it is probably a network issue that prevents the health lambda from sending the outbound request. We should consider setting up an interface endpoint for API Gateway, if that is even an option. Assignee to investigate that."

@dsotirho-ucsc dsotirho-ucsc added the spike:1 [process] Spike estimate of one point label Jan 31, 2024
@hannes-ucsc
Copy link
Member

Not now.

@achave11-ucsc achave11-ucsc added the wontfix [process] Issue is valid but resolution is not deemed necessary label Apr 24, 2024
@achave11-ucsc
Copy link
Member Author

@hannes-ucsc: "#6097 will workaround this by increasing the threshold for alarms about this. So we'll essentially ignore a certain low rate of incidents of this type."

@achave11-ucsc achave11-ucsc closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts orange [process] Done by the Azul team spike:1 [process] Spike estimate of one point wontfix [process] Issue is valid but resolution is not deemed necessary
Projects
None yet
Development

No branches or pull requests

3 participants