Skip to content

GKECreateClusterOperator may leak clusters on PermissionDenied during operation polling #62301

@SameerMesiah97

Description

@SameerMesiah97

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

apache-airflow-providers-google>=20.0.0rc1

Apache Airflow version

main

Operating System

Debian GNU/Linux 12 (bookworm)

Deployment

Other

Deployment details

No response

What happened

When using GKECreateClusterOperator, a GKE cluster may be successfully created even when the GCP service account has partial GKE permissions, for example lacking container.operations.get.

In this scenario, the operator successfully calls create_cluster and the GKE cluster begins provisioning in GCP. However, subsequent steps—such as polling the operation in non-deferrable mode—fail due to insufficient permissions.

The Airflow task then fails, but the GKE cluster continues provisioning or remains active in GCP, resulting in leaked infrastructure and ongoing cost.

This can occur, for example, when the service account allows container.clusters.create but explicitly denies container.operations.get, which is required to monitor the long-running operation.

What you think should happen instead

If the operator fails after successfully initiating cluster creation (for example due to missing container.operations.get or other follow-up permissions), it should make a best-effort attempt to clean up the partially created resource by deleting the cluster.

Cleanup should be attempted opportunistically (i.e. only if the cluster name is known and deletion permissions are available), and failure to clean up should not mask or replace the original exception.

How to reproduce

  1. Create a custom IAM role that allows container.clusters.create and denies/omits container.operations.get

  2. Create a service account and attach this custom role.

  3. Create a GCP connection in Airflow using this service account.
    (For example: gcp_cloud_default.)

  4. Use the following DAG:
    (Please replace <PROJECT_ID> and <REGION>
    with your GCP project ID and a valid region, respectively.)

from datetime import datetime

from airflow import DAG
from airflow.providers.google.cloud.operators.kubernetes_engine import GKECreateClusterOperator

with DAG(
    dag_id="gke_partial_auth_cluster_leak_repro",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
) as dag:

    create_cluster = GKECreateClusterOperator(
        task_id="create_gke_cluster",
        project_id=<PROJECT_ID>,
        location=<REGION>,
        body={
            "name": "leaky-gke-cluster",
            "initial_node_count": 1,
        },
        gcp_conn_id="gcp_cloud_default",
        deferrable=False,  # triggers polling via operations.get
    )
  1. Trigger the DAG.

Observed Behaviour

The task fails with:

PermissionDenied: Required "container.operations.get" permission(s)

However, the GKE cluster continues to provision in the background.

Anything else

GKE clusters begin provisioning immediately once creation is initiated. Even if the Airflow task fails shortly after, the cluster may continue creating and eventually become active.

When failures occur after a successful create call (for example, due to partially scoped IAM permissions), leaked clusters can result in unnecessary cost and manual cleanup effort. This pattern is not novel in Airflow. Similar behaviour has been accepted in AWS resource-creation operators, for example with Amazon Redshift cluster creation (see PR #61333), where infrastructure can be created successfully but leak if subsequent steps fail. Aligning the GKE operator with a best-effort cleanup approach would therefore not introduce a new behavioural precedent. It would bring it in line with existing provider patterns.

Relying solely on teardown tasks is not sufficient, as that shifts responsibility for preventing resource leaks onto DAG authors. Operators that create infrastructure should make reasonable best-effort attempts to clean up resources they successfully create, even if later steps fail.

While the GKE API does not always accept deletion requests during PROVISIONING, that limitation does not preclude best-effort cleanup logic (e.g. retrying deletion or attempting deletion once the cluster becomes deletable).

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:providerskind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetprovider:googleGoogle (including GCP) related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions