Optimize K8s API usage for watching events #59080

wolfdn · 2025-12-05T06:53:04Z

Description

This PR optimizes how the KubernetesPodOperator interacts with the Kubernetes API when retrieving events.

Previously, the operator did not pass the resourceVersion parameter when listing events for a pod. This forced Kubernetes to perform a quorum read for every request—an expensive operation. Combined with frequent polling for new events, this created significant load on the Kubernetes API and etcd, especially when many pods were started in parallel.

Best practice is for clients to store the resourceVersion from each response and provide it in subsequent requests. This allows Kubernetes to serve the event list far more efficiently. As stated in the Kubernetes documentation:

Unless you have strong consistency requirements, using resourceVersionMatch=NotOlderThan and a known resourceVersion is preferable since it can achieve better performance and scalability of your cluster than leaving resourceVersion and resourceVersionMatch unset, which requires quorum read to be served.

Reference: https://kubernetes.io/docs/reference/using-api/api-concepts/#semantics-for-get-and-list

With this change, the operator performs one initial event listing without a resourceVersion, and all subsequent requests include the last known resourceVersion.

Additionally, this PR introduces usage of the Kubernetes watch API in deferred (asynchronous) mode. Instead of polling every few seconds, the operator can now watch for new events. This provides two major benefits:

New events become visible almost immediately.
The number of requests sent to the Kubernetes API is reduced because the watch connection remains active for a longer period.

We implemented this change after observing a high number of HTTP 429 (rate-limited) responses from our cluster’s API server. One contributing factor was the large volume of GET requests for event listings, which placed heavy load on etcd. After deploying a patched version of the operator with these improvements, the number of 429 responses dropped from several thousand per minute to nearly zero.

Changes

Remember resourceVersion when retrieving events from K8s API
Use Kubernetes watch API to watch events when running in deferred (asynchronous) mode
- There is also a fallback to poll events in deferred mode in case that the Airflow triggerer does not have the permission to watch events (to stay compatible with older versions of Helm chart)
Improve mechanism to avoid printing duplicate events (now remembers seen event UIDs instead of counting events)
Add watch verb for events in pod launcher role in Helm chart (required so that Airflow triggerer has permission to watch events)

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

jscheffl

Thanks for this - looks impressive improvement!

One mini nit and before merging would leave the PR open a few days for other 4 eyes to review. LGTM in my view.

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py

jscheffl · 2025-12-07T18:07:29Z

@jedcunningham Would be cool to have your opinion and having this in chart 1.19 release as well.

…s/utils/pod_manager.py Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

jscheffl · 2025-12-08T20:06:54Z

As I heard from @AutomationDev85 about some problems with the asny event polling we might need tomorrow to triage to double-check this is not adding more problems than benefits. Please do not merge before clarified tomorrow (which might be 10.00 CET Tuesday, 9th)

AutomationDev85 · 2025-12-11T09:07:04Z

@jscheffl @wolfdn I added a commit which improves pod start handling by awaiting start completion and cancelling the parallel event watcher. This resolves a sporadic hang in the event stream; after testing, the new approach proved more stable.

jscheffl

Looks good and even a bit better like this. Would still wait (as no pressure to merge) for 1-2 days hoping for feedback from others prior merge. Some more eyes might be good.

potiuk · 2025-12-11T23:04:12Z

This looks lie a fantastic improvement.

Optimize K8s API usage for watching events

c9d228c

wolfdn requested review from hussein-awala, jedcunningham and jscheffl as code owners December 5, 2025 06:53

boring-cyborg bot added area:helm-chart Airflow Helm Chart area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Dec 5, 2025

wolfdn added 2 commits December 5, 2025 09:50

Fix mypy errors

c2992c1

Merge branch 'main' into feature/optimize-kubernetes-api-usage

5dec1af

jscheffl approved these changes Dec 7, 2025

View reviewed changes

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py Outdated Show resolved Hide resolved

jscheffl added this to the Airflow Helm Chart 1.19.0 milestone Dec 7, 2025

Update providers/cncf/kubernetes/src/airflow/providers/cncf/kubernete…

6568b0c

…s/utils/pod_manager.py Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>

Fix hanging API communication during pod event watching

8e24bc2

jscheffl approved these changes Dec 11, 2025

View reviewed changes

potiuk approved these changes Dec 11, 2025

View reviewed changes

potiuk merged commit 81dee2f into apache:main Dec 11, 2025
125 checks passed

shahar1 mentioned this pull request Dec 30, 2025

Status of testing Providers that were prepared on December 30, 2025 #59952

Closed

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize K8s API usage for watching events #59080

Optimize K8s API usage for watching events #59080

Uh oh!

wolfdn commented Dec 5, 2025

Uh oh!

jscheffl left a comment

Uh oh!

Uh oh!

jscheffl commented Dec 7, 2025

Uh oh!

jscheffl commented Dec 8, 2025

Uh oh!

AutomationDev85 commented Dec 11, 2025

Uh oh!

jscheffl left a comment

Uh oh!

potiuk commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimize K8s API usage for watching events #59080

Optimize K8s API usage for watching events #59080

Uh oh!

Conversation

wolfdn commented Dec 5, 2025

Description

Changes

Uh oh!

jscheffl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jscheffl commented Dec 7, 2025

Uh oh!

jscheffl commented Dec 8, 2025

Uh oh!

AutomationDev85 commented Dec 11, 2025

Uh oh!

jscheffl left a comment

Choose a reason for hiding this comment

Uh oh!

potiuk commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants