-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection reset when health check makes HEAD request to service endpoint #5702
Comments
Assignee to consider next steps. Note: |
The two occurrences in the description have very different symptoms, and therefore different causes. We should break this up into two different issues and track every occurrence so we get an idea of the prevalence. Furthermore, while the second occurrence (the "Connection reset by peer" one) look like a stale connection—the stack trace shows that requests.head() is used—and that method allocates a fresh urllib3 connection pool every time it is called. IOW, it can't be a stale connection. |
@hannes-ucsc: "We edited the description to remove the Lambda timeout which is already tracked as #5467. Looking at the alarms, the prevalence of the connection error (which we'll continue to track here) is fairly low (only one occurrence on Nov 16)." |
Another occurrence,
|
anvilprod
|
@hannes-ucsc: "We thought this might correlate with deploy jobs but it doesn't:" |
@hannes-ucsc: "Considering that there are no log entries by API Gateway or the receiving lambda function (see description), it is probably a network issue that prevents the health lambda from sending the outbound request. We should consider setting up an interface endpoint for API Gateway, if that is even an option. Assignee to investigate that." |
Not now. |
@hannes-ucsc: "#6097 will workaround this by increasing the threshold for alarms about this. So we'll essentially ignore a certain low rate of incidents of this type." |
… with incidents
in dev and anvildev, where the last execution log is the HEAD request by servicecachehealth to Azul service endpoints that never returns and results in a silent execution timeout. Additionally, API Gateway and service Lambda execution logs in these two deployments lack evidence of these HEAD requests ever being received.API Gateway, Service and Service Cache Health Lambda execution logs for `dev` :And note that there are no entries for API Gateway or Service dev.
CloudWatch Logs Insights
region: us-east-1
log-group-names: /aws/lambda/azul-service-dev-servicecachehealth, /aws/lambda/azul-service-dev, /aws/apigateway/azul-service-dev, API-Gateway-Execution-Logs_8qodesspsa/dev, API-Gateway-Execution-Logs_ann5yskrli/dev, API-Gateway-Execution-Logs_b8ddywgc9a/dev, API-Gateway-Execution-Logs_q4pfvvk389/dev
start-time: 2023-11-16T15:07:45.312Z
end-time: 2023-11-16T15:08:06.410Z
query-string:
A more informative version ofthis incident happened inanvilprod
, where it didn't fail silently by timing out but actually produced error logs in the servicecachehealth execution.API Gateway, Service and Service Cache Health Lambda execution logs for `anvilprod` :
Traceback (most recent call last):
File "/var/task/azul/chalice.py", line 166, in patched_event_source_handler
return old_handler(self_, event, context)
File "/var/task/chalice/app.py", line 1753, in __call__
return self.handler(event_obj)
File "/var/task/app.py", line 549, in update_health_cache
app.health_controller.update_cache()
File "/var/task/azul/health.py", line 138, in update_cache
health_object = dict(time=time.time(), health=self._health.as_json_fast())
File "/var/task/azul/health.py", line 308, in as_json_fast
return self.as_json(p.key for p in self.fast_properties[self.lambda_name])
File "/var/task/azul/health.py", line 181, in as_json
json = {k: getattr(self, k) for k in keys}
File "/var/task/azul/health.py", line 181, in <dictcomp>
json = {k: getattr(self, k) for k in keys}
File "/var/task/azul/health.py", line 73, in __get__
return super().__get__(obj, objtype=objtype)
File "/var/task/azul/caching.py", line 189, in __get__
value = obj.__dict__[self.fget.__name__] = self.fget(obj)
File "/var/task/azul/health.py", line 265, in api_endpoints
return self._api_endpoint(entity_type)
File "/var/task/azul/health.py", line 245, in _api_endpoint
response = requests.head(url)
File "/opt/python/requests/api.py", line 100, in head
return request("head", url, **kwargs)
File "/opt/python/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/opt/python/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/opt/python/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/opt/python/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
The text was updated successfully, but these errors were encountered: