Skip to content

Comments

RedshiftCreateClusterOperator could leave clusters running after failure#61333

Merged
potiuk merged 1 commit intoapache:mainfrom
SameerMesiah97:61324-RedshiftCreateClusterOperator-Cleanup
Feb 15, 2026
Merged

RedshiftCreateClusterOperator could leave clusters running after failure#61333
potiuk merged 1 commit intoapache:mainfrom
SameerMesiah97:61324-RedshiftCreateClusterOperator-Cleanup

Conversation

@SameerMesiah97
Copy link
Contributor

@SameerMesiah97 SameerMesiah97 commented Feb 1, 2026

Description

Added best-effort cleanup for Redshift cluster creation to ensure clusters are deleted when failures occur after a cluster has been successfully created. Cleanup behavior is guarded by a flag and is opted in by default.

Previously, Redshift cluster creation could succeed via create_cluster, but the operator could then fail during post-creation steps when wait_for_completion=True and the IAM role lacking redshift:DescribeClusters permissions. In these cases, the Airflow task failed while the Redshift cluster continued provisioning or remained active in AWS, resulting in leaked infrastructure.

Cleanup has now been implemented for RedshiftCreateClusterOperator. If WaiterError is raised after cluster creation has been initiated. the operator attempts a best-effort deletion of the cluster. Cleanup failures are logged but do not mask or replace the original exception.

Rationale

Redshift cluster creation can succeed while post-creation steps fail. This commonly occurs with partially scoped IAM roles, for example, allowing redshift:CreateCluster but denying redshift:DescribeClusters, which is required by the availability waiter.

In these scenarios, the Airflow task fails while the cluster continues provisioning or running in AWS, leading to leaked infrastructure and ongoing cost. This change ensures that when a cluster has been started by the operator, failures during post-creation steps trigger a best-effort cleanup without altering error semantics or impacting unrelated resources.

It is also plausible for a cluster to reach an available state before cleanup is attempted. Cluster creation proceeds asynchronously in AWS and may complete independently of the waiter outcome or permission failures. In such cases, the cluster is immediately deletable, and attempting cleanup can successfully reclaim resources that would otherwise be left running.

Tests

  • Added a unit test verifying that cluster deletion is attempted when a WaiterError occurs during the wait phase after successful cluster creation.
  • Added a unit test ensuring that failures during cleanup do not mask or override the original exception raised by the waiter.

Documentation

The docstring for RedshiftCreateClusterOperator has been updated to document the new flag delete_cluster_on_failure and its default behavior.

Backwards Compatibility

A new flag called delete_cluster_on_failure has been added to RedshiftCreateClusterOperator with a default value of True. Best-effort cleanup will now be attempted if a post-creation failure (including WaiterError) occurs after the cluster has been successfully created.

Closes: #61324

occur after successful creation (e.g. waiter failures due to missing
DescribeClusters permissions).

This change adds best-effort cleanup when post-create steps fail by attempting
to delete the cluster that was successfully created. Cleanup errors are logged
but do not mask the original exception. This mode is opt-in by default.

Tests cover successful cleanup on waiter failure and ensure cleanup failures
do not override the original error.
@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Feb 1, 2026
@eladkal eladkal requested a review from vincbeck February 5, 2026 18:39
@potiuk
Copy link
Member

potiuk commented Feb 15, 2026

Nice!

@potiuk potiuk merged commit 39b914b into apache:main Feb 15, 2026
90 checks passed
choo121600 pushed a commit to choo121600/airflow that referenced this pull request Feb 22, 2026
…res (apache#61333)

occur after successful creation (e.g. waiter failures due to missing
DescribeClusters permissions).

This change adds best-effort cleanup when post-create steps fail by attempting
to delete the cluster that was successfully created. Cleanup errors are logged
but do not mask the original exception. This mode is opt-in by default.

Tests cover successful cleanup on waiter failure and ensure cleanup failures
do not override the original error.

Co-authored-by: Sameer Mesiah <smesiah971@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

4 participants