Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib3.exceptions.ProtocolError from long lived watch #1693

Closed
zapman449 opened this issue Feb 6, 2022 · 5 comments
Closed

urllib3.exceptions.ProtocolError from long lived watch #1693

zapman449 opened this issue Feb 6, 2022 · 5 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@zapman449
Copy link
Contributor

What happened (please include outputs or screenshots):

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 696, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 436, in _error_catcher
    yield
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 763, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 700, in _update_chunk_length
    raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/autoscaler/ski_utils/pod_watcher.py", line 126, in watch_pods
    main(podwatcher, control_event)
  File "/opt/autoscaler/ski_utils/pod_watcher.py", line 112, in main
    for event_type, pod in watch_all_pods(control_event):
  File "/opt/autoscaler/ski_utils/pod_watcher.py", line 63, in watch_all_pods
    for event in w.stream(v1.list_pod_for_all_namespaces):     # type: ignore
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 165, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for seg in resp.stream(amt=None, decode_content=False):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 571, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 792, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 454, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

Code which generated:

from kubernetes import watch
from kubernetes.client import V1Pod
from kubernetes.client.rest import ApiException as RestApiException
from kubernetes.client.exceptions import ApiException as KubeApiException

from urllib3.exceptions import ProtocolError as Urllib3ProtocolError

...
    while True:
        v1 = kube_filter.get_core_kube_client()
        w = watch.Watch()
        try:
            for event in w.stream(v1.list_pod_for_all_namespaces):     # type: ignore
                yield event['type'], event['object']
                if control_event.is_set():
                    break
                if failure_count != 0 and int(time.time()) - last_failure_time > clear_errors_after_seconds:
                    # clear failure count if more than 20 minutes has passed
                    failure_count = 0
        except (KubeApiException, RestApiException) as e:
            if e.status == "410":
                logging.info("resource expired. Will restart the watch")
            else:
                failure_count += 1
                last_failure_time = int(time.time())
                if failure_count >= max_failure_count:
                    logging.error(
                        f"hit {max_failure_count} errors in less than {clear_errors_after_seconds} seconds. Will exit ugly."
                    )
                    raise
                logging.exception("kuberentes hit api error. Pausing and will reconnect")
                time.sleep(3)
        except:
            failure_count += 4      # more aggressively increment failure count.
            last_failure_time = int(time.time())
            if failure_count >= max_failure_count:
                logging.error(
                    f"hit {max_failure_count} errors in less than {clear_errors_after_seconds} seconds. Will exit ugly."
                )
                raise
            logging.exception("hit unhandled exception.  Will pause and retry, but not hopeful.")
            time.sleep(3)
        finally:
            w.stop()

Code I need to add to ^ to handle:

        except Urllib3ProtocolError as e:
            failure_count += 1
            last_failure_time = int(time.time())
            if failure_count >= max_failure_count:
                logging.error(
                    f"hit {max_failure_count} errors in less than {clear_errors_after_seconds} seconds. Will exit ugly."
                )
                raise
            logging.exception("kuberentes hit urllib3 error. Pausing and will reconnect")
            time.sleep(3)

What you expected to happen:

The python client shouldn't surface a urllib3 error, but rather a KubeApiException or similar. Ideally, the watch code would retry internally.

How to reproduce it (as minimally and precisely as possible):

Happens "regularly" on our EKS clusters running kube 1.19. Regularly meaning several times a day, which is why we have the elaborate error handling. The error handling above should be in the documentation (or have a document for "long lived watches")

Anything else we need to know?:

Environment: n/a

  • Kubernetes version:
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.15-eks-9c63c4", GitCommit:"9c63c4037a56f9cad887ee76d55142abd4155179", GitTreeState:"clean", BuildDate:"2021-10-20T00:21:03Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
  • OS: Host OS Amazon Linux 2. Container image: python:3.9-slim-buster aka debian
  • Python version: Python 3.9.10
  • Python client version: kubernetes 19.15.0
@zapman449 zapman449 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 6, 2022
@roycaihw
Copy link
Member

The watch client itself doesn't handle retry (so does the watch in client-go).

The right solution is to implement an informer in this client #868, or have something smaller like a retrywatcher.

cc @yliaog

@roycaihw roycaihw added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 14, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 14, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

caboteria added a commit to epic-gateway/ansible-playbook that referenced this issue Sep 5, 2023
These pop up frequently and while systemctl restarts the daemon the
chaff in the log is annoying.

It looks like the kubernetes-client maintainers won't fix the bug so
we've got to work around it.

kubernetes-client/python#1693
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants