Skip to content

urllib3.exceptions.ProtocolError from long lived watch #1693

Closed
@zapman449

Description

@zapman449

What happened (please include outputs or screenshots):

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 696, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 436, in _error_catcher
    yield
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 763, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 700, in _update_chunk_length
    raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/autoscaler/ski_utils/pod_watcher.py", line 126, in watch_pods
    main(podwatcher, control_event)
  File "/opt/autoscaler/ski_utils/pod_watcher.py", line 112, in main
    for event_type, pod in watch_all_pods(control_event):
  File "/opt/autoscaler/ski_utils/pod_watcher.py", line 63, in watch_all_pods
    for event in w.stream(v1.list_pod_for_all_namespaces):     # type: ignore
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 165, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for seg in resp.stream(amt=None, decode_content=False):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 571, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 792, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 454, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

Code which generated:

from kubernetes import watch
from kubernetes.client import V1Pod
from kubernetes.client.rest import ApiException as RestApiException
from kubernetes.client.exceptions import ApiException as KubeApiException

from urllib3.exceptions import ProtocolError as Urllib3ProtocolError

...
    while True:
        v1 = kube_filter.get_core_kube_client()
        w = watch.Watch()
        try:
            for event in w.stream(v1.list_pod_for_all_namespaces):     # type: ignore
                yield event['type'], event['object']
                if control_event.is_set():
                    break
                if failure_count != 0 and int(time.time()) - last_failure_time > clear_errors_after_seconds:
                    # clear failure count if more than 20 minutes has passed
                    failure_count = 0
        except (KubeApiException, RestApiException) as e:
            if e.status == "410":
                logging.info("resource expired. Will restart the watch")
            else:
                failure_count += 1
                last_failure_time = int(time.time())
                if failure_count >= max_failure_count:
                    logging.error(
                        f"hit {max_failure_count} errors in less than {clear_errors_after_seconds} seconds. Will exit ugly."
                    )
                    raise
                logging.exception("kuberentes hit api error. Pausing and will reconnect")
                time.sleep(3)
        except:
            failure_count += 4      # more aggressively increment failure count.
            last_failure_time = int(time.time())
            if failure_count >= max_failure_count:
                logging.error(
                    f"hit {max_failure_count} errors in less than {clear_errors_after_seconds} seconds. Will exit ugly."
                )
                raise
            logging.exception("hit unhandled exception.  Will pause and retry, but not hopeful.")
            time.sleep(3)
        finally:
            w.stop()

Code I need to add to ^ to handle:

        except Urllib3ProtocolError as e:
            failure_count += 1
            last_failure_time = int(time.time())
            if failure_count >= max_failure_count:
                logging.error(
                    f"hit {max_failure_count} errors in less than {clear_errors_after_seconds} seconds. Will exit ugly."
                )
                raise
            logging.exception("kuberentes hit urllib3 error. Pausing and will reconnect")
            time.sleep(3)

What you expected to happen:

The python client shouldn't surface a urllib3 error, but rather a KubeApiException or similar. Ideally, the watch code would retry internally.

How to reproduce it (as minimally and precisely as possible):

Happens "regularly" on our EKS clusters running kube 1.19. Regularly meaning several times a day, which is why we have the elaborate error handling. The error handling above should be in the documentation (or have a document for "long lived watches")

Anything else we need to know?:

Environment: n/a

  • Kubernetes version:
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.15-eks-9c63c4", GitCommit:"9c63c4037a56f9cad887ee76d55142abd4155179", GitTreeState:"clean", BuildDate:"2021-10-20T00:21:03Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
  • OS: Host OS Amazon Linux 2. Container image: python:3.9-slim-buster aka debian
  • Python version: Python 3.9.10
  • Python client version: kubernetes 19.15.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions