Skip to content

Conversation

@AutomationDev85
Copy link
Contributor

@AutomationDev85 AutomationDev85 commented Jan 8, 2026

Overview

We observed intermittent hangs in Kubernetes API communication from KubernetesPodOperator, with calls stalling for over a day until tasks were manually stopped. Investigation showed the Kubernetes Python client wasn’t using a client-side timeout, so stalled connections could block indefinitely. To improve robustness, we add a client-side timeout to API calls so they raise a clear exception instead of leaving tasks hanging. This does not fix the underlying cluster/API issue, but it makes failures detectable and recoverable.

We chose a 60-second timeout: long enough to tolerate load, short enough to prevent indefinite hangs.

Change Summary

  • Set a 60-second client-side timeout for Kubernetes API requests.
  • Apply the timeout to sync and async wrapper API client to enable all API calls with the timeout.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jan 8, 2026
@AutomationDev85 AutomationDev85 force-pushed the bugfix/kubernetes-api-calls-use-timeout branch from 275d65b to 25ec50c Compare January 9, 2026 14:01
@SameerMesiah97
Copy link
Contributor

SameerMesiah97 commented Jan 9, 2026

Overall solid PR. The wrapper class is a great approach.

But there is one potential interaction I wanted to flag based on reading through the code path.. If you look at def watch_pod_events (should be lines 1005-1054 in your branch), you will see that there is a parameter timeout_seconds which has a default argument of 30s (let's call this the 'watch timeout') . As your new wrapper class _TimeoutK8sApiClient (which sets the 'API timeout') is injected when def get_conn is called by def watch_pod_events.

In idle or stalled watch scenarios (i.e. no events received for a period), this effectively means the API timeout can act as a hard upper bound on how long the watch remains open if it is lower than timeout_seconds. In those cases, the watch timeout becomes a best-case limit rather than a guarantee.

Normally, this would not be too concerning but as def watch_pod_events is a public API, and this 'API timeout' is an internal default setting, it would not hurt to document this behavior.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some typing hint nits, else LGTM!

Also might be good to consider the comments from @SameerMesiah97 - mainy documenting this (maybe inline in code?) I know that watches are limited to 30s which was the reason for the timeout and adding the 60s hard limit still I assume is beneficial to make it fail-safe.

@AutomationDev85
Copy link
Contributor Author

@jscheffl Thank you for your input. This is a good suggestion, and I've implemented the changes accordingly.

@SameerMesiah97 Thank you for the feedback. I've implemented a dynamic approach where the client-side timeout now depends on the server-side timeout. If the server-side timeout exceeds the default API_TIMEOUT, the client-side timeout is set to the server-side timeout plus an offset. This ensures the server has adequate time to respond before the client times out. I believe this is a more robust solution than simply documenting the behavior.

@AutomationDev85 AutomationDev85 force-pushed the bugfix/kubernetes-api-calls-use-timeout branch from a3f90ab to 718a59d Compare January 12, 2026 10:57
@SameerMesiah97
Copy link
Contributor

@jscheffl Thank you for your input. This is a good suggestion, and I've implemented the changes accordingly.

@SameerMesiah97 Thank you for the feedback. I've implemented a dynamic approach where the client-side timeout now depends on the server-side timeout. If the server-side timeout exceeds the default API_TIMEOUT, the client-side timeout is set to the server-side timeout plus an offset. This ensures the server has adequate time to respond before the client times out. I believe this is a more robust solution than simply documenting the behavior.

Great. That is even better as it makes the timeout behave more in line with user expectations.

I do not want to sound pedantic here but I think this dynamic approach towards handling conflicting timeouts should be mentioned in the docstring for def watch_pod_events. Maybe something like:

timeout_seconds – Timeout in seconds for the watch stream. A small additional buffer may be applied internally.

Though the 5 second default is quite short, users scouring the logs might notice the discrepancy and think this is unintentional.

Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it as it is already. But adding a small note does not hurt - @AutomationDev85 can you apply this nit? Then I am able to merge to the next providers wave by tomorrow.

@AutomationDev85
Copy link
Contributor Author

@SameerMesiah97 I've adopted your text—thanks for the great improvement!

@jscheffl jscheffl merged commit 3f3d0f5 into apache:main Jan 13, 2026
242 of 247 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants