GKECreateClusterOperator may leak clusters on PermissionDenied during operation polling

### Apache Airflow Provider(s)

google

### Versions of Apache Airflow Providers

`apache-airflow-providers-google>=20.0.0rc1`

### Apache Airflow version

main

### Operating System

Debian GNU/Linux 12 (bookworm)

### Deployment

Other

### Deployment details

_No response_

### What happened

When using `GKECreateClusterOperator`, a GKE cluster may be successfully created even when the GCP service account has partial GKE permissions, for example lacking `container.operations.get`.

In this scenario, the operator successfully calls `create_cluster` and the GKE cluster begins provisioning in GCP. However, subsequent steps—such as polling the operation in non-deferrable mode—fail due to insufficient permissions.

The Airflow task then fails, but the GKE cluster continues provisioning or remains active in GCP, resulting in leaked infrastructure and ongoing cost.

This can occur, for example, when the service account allows `container.clusters.create` but explicitly denies `container.operations.get`, which is required to monitor the long-running operation.

### What you think should happen instead

If the operator fails after successfully initiating cluster creation (for example due to missing `container.operations.get` or other follow-up permissions), it should make a best-effort attempt to clean up the partially created resource by deleting the cluster.

Cleanup should be attempted opportunistically (i.e. only if the cluster name is known and deletion permissions are available), and failure to clean up should not mask or replace the original exception.

### How to reproduce

1. Create a custom IAM role that allows `container.clusters.create` and denies/omits `container.operations.get`

2. Create a service account and attach this custom role.

3. Create a GCP connection in Airflow using this service account.
(For example: `gcp_cloud_default`.)

4. Use the following DAG:
(Please replace `<PROJECT_ID>` and `<REGION>` 
with your GCP project ID and a valid region, respectively.) 

```python
from datetime import datetime

from airflow import DAG
from airflow.providers.google.cloud.operators.kubernetes_engine import GKECreateClusterOperator

with DAG(
    dag_id="gke_partial_auth_cluster_leak_repro",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
) as dag:

    create_cluster = GKECreateClusterOperator(
        task_id="create_gke_cluster",
        project_id=<PROJECT_ID>,
        location=<REGION>,
        body={
            "name": "leaky-gke-cluster",
            "initial_node_count": 1,
        },
        gcp_conn_id="gcp_cloud_default",
        deferrable=False,  # triggers polling via operations.get
    )
```

5. Trigger the DAG.

**Observed Behaviour**

The task fails with:

`PermissionDenied: Required "container.operations.get" permission(s)`

 However, the GKE cluster continues to provision in the background.

### Anything else

GKE clusters begin provisioning immediately once creation is initiated. Even if the Airflow task fails shortly after, the cluster may continue creating and eventually become active.

When failures occur after a successful create call (for example, due to partially scoped IAM permissions), leaked clusters can result in unnecessary cost and manual cleanup effort. This pattern is not novel in Airflow. Similar behaviour has been accepted in AWS resource-creation operators, for example with Amazon Redshift cluster creation (see PR #61333), where infrastructure can be created successfully but leak if subsequent steps fail. Aligning the GKE operator with a best-effort cleanup approach would therefore not introduce a new behavioural precedent. It would bring it in line with existing provider patterns.

**Relying solely on teardown tasks is not sufficient, as that shifts responsibility for preventing resource leaks onto DAG authors**. Operators that create infrastructure should make reasonable best-effort attempts to clean up resources they successfully create, even if later steps fail.

While the GKE API does not always accept deletion requests during `PROVISIONING`, that limitation does not preclude best-effort cleanup logic (e.g. retrying deletion or attempting deletion once the cluster becomes deletable).


### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKECreateClusterOperator may leak clusters on PermissionDenied during operation polling #62301

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GKECreateClusterOperator may leak clusters on PermissionDenied during operation polling #62301

Description

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions