[GkeStartPodOperator] - Kubernetes client request can hand indefinitely #36802
Labels
area:providers
good first issue
kind:bug
This is a clearly a bug
provider:google
Google (including GCP) related issues
Apache Airflow Provider(s)
google
Versions of Apache Airflow Providers
apache-airflow-providers-google==10.12.0
Apache Airflow version
2.6.3
Operating System
Ubuntu 20.04.6
Deployment
Other
Deployment details
N/A
What happened
In a DAG with ~500 GkeStartPodOperator tasks (running pods on another cluster, hosted on GKE) we discovered that operator execution hangs polling logs in ~0.2% of the task instances. Based on logs, the execution halts in the call inside kubernetes client (
read_namespaced_pod_log
to be exact).Only after the DAG run timeout (hours later), when SIGTERM is dispatched to the
task run
process, execution resumes, attempts to retry to fetch logs and pod status, but those have already been garbage collected.This looks exactly like kubernetes-client/python#1234 (comment). After running the same deployment in the deferred mode, 1 task also ended up being locked up in a similar way, this time for another call (for creation):
I believe this is specific to GkeStartPodOperator, as KubernetesHook does have the mechanism ensuring TCP keep alive is configured by default:
airflow/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Line 216 in 1d5d502
airflow/airflow/providers/google/cloud/hooks/kubernetes_engine.py
Line 390 in 1d5d502
What you think should happen instead
GKEPodHook should reuse the same socket configuration used in KubernetesHook and configure TCP Keepalive (unless disabled).
How to reproduce
Run ~500 tasks on GKE with spot VMs. There is no reliable repro, but the problem has been clearly documented before and fixed for CNCF-k8s provider: #11406.
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: