Closed
Description
What happened (please include outputs or screenshots):
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 696, in _update_chunk_length
self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 436, in _error_catcher
yield
File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 763, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 700, in _update_chunk_length
raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/autoscaler/ski_utils/pod_watcher.py", line 126, in watch_pods
main(podwatcher, control_event)
File "/opt/autoscaler/ski_utils/pod_watcher.py", line 112, in main
for event_type, pod in watch_all_pods(control_event):
File "/opt/autoscaler/ski_utils/pod_watcher.py", line 63, in watch_all_pods
for event in w.stream(v1.list_pod_for_all_namespaces): # type: ignore
File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 165, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
for seg in resp.stream(amt=None, decode_content=False):
File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 571, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 792, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.9/contextlib.py", line 137, in __exit__
self.gen.throw(typ, value, traceback)
File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
Code which generated:
from kubernetes import watch
from kubernetes.client import V1Pod
from kubernetes.client.rest import ApiException as RestApiException
from kubernetes.client.exceptions import ApiException as KubeApiException
from urllib3.exceptions import ProtocolError as Urllib3ProtocolError
...
while True:
v1 = kube_filter.get_core_kube_client()
w = watch.Watch()
try:
for event in w.stream(v1.list_pod_for_all_namespaces): # type: ignore
yield event['type'], event['object']
if control_event.is_set():
break
if failure_count != 0 and int(time.time()) - last_failure_time > clear_errors_after_seconds:
# clear failure count if more than 20 minutes has passed
failure_count = 0
except (KubeApiException, RestApiException) as e:
if e.status == "410":
logging.info("resource expired. Will restart the watch")
else:
failure_count += 1
last_failure_time = int(time.time())
if failure_count >= max_failure_count:
logging.error(
f"hit {max_failure_count} errors in less than {clear_errors_after_seconds} seconds. Will exit ugly."
)
raise
logging.exception("kuberentes hit api error. Pausing and will reconnect")
time.sleep(3)
except:
failure_count += 4 # more aggressively increment failure count.
last_failure_time = int(time.time())
if failure_count >= max_failure_count:
logging.error(
f"hit {max_failure_count} errors in less than {clear_errors_after_seconds} seconds. Will exit ugly."
)
raise
logging.exception("hit unhandled exception. Will pause and retry, but not hopeful.")
time.sleep(3)
finally:
w.stop()
Code I need to add to ^ to handle:
except Urllib3ProtocolError as e:
failure_count += 1
last_failure_time = int(time.time())
if failure_count >= max_failure_count:
logging.error(
f"hit {max_failure_count} errors in less than {clear_errors_after_seconds} seconds. Will exit ugly."
)
raise
logging.exception("kuberentes hit urllib3 error. Pausing and will reconnect")
time.sleep(3)
What you expected to happen:
The python client shouldn't surface a urllib3 error, but rather a KubeApiException
or similar. Ideally, the watch code would retry internally.
How to reproduce it (as minimally and precisely as possible):
Happens "regularly" on our EKS clusters running kube 1.19. Regularly meaning several times a day, which is why we have the elaborate error handling. The error handling above should be in the documentation (or have a document for "long lived watches")
Anything else we need to know?:
Environment: n/a
- Kubernetes version:
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.15-eks-9c63c4", GitCommit:"9c63c4037a56f9cad887ee76d55142abd4155179", GitTreeState:"clean", BuildDate:"2021-10-20T00:21:03Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
- OS: Host OS Amazon Linux 2. Container image:
python:3.9-slim-buster
aka debian - Python version:
Python 3.9.10
- Python client version:
kubernetes 19.15.0