Skip to content

DataprocDeleteClusterOperator fails if cluster was already deleted by DataprocCreateClusterOperator(delete_on_error=True) #59812

@shivannakarthik

Description

@shivannakarthik

Apache Airflow Provider(s)

google

Versions of Apache Airflow Providers

No response

Apache Airflow version

main

Operating System

ubuntu

Deployment

Astronomer

Deployment details

No response

What happened

When implementing the Ephemeral Dataproc Cluster pattern:
Create Cluster -> Run Jobs -> Delete Cluster (TriggerRule.ALL_DONE)

There is a conflict between the default behavior of DataprocCreateClusterOperator and the downstream DataprocDeleteClusterOperator.

  1. DataprocCreateClusterOperator has delete_on_error=True by default. If the cluster creation fails and ends up in an ERROR state, the operator automatically deletes the cluster.
  2. The downstream DataprocDeleteClusterOperator triggers (due to TriggerRule.ALL_DONE).
  3. It attempts to delete the cluster which no longer exists.
  4. The DataprocDeleteClusterOperator fails with a NotFound (404) error from the Google Cloud API.

This causes the cleanup task to be marked as failed, which creates noise and can potentially mask the actual upstream failure in monitoring views.

What you think should happen instead

DataprocDeleteClusterOperator should ideally be idempotent. If the cluster is already deleted (returns 404 NotFound), the operator should consider the task successful (or skipped) rather than failed.

Currently, the deferrable mode implementation checks for existence:

            try:
                hook.get_cluster(...)
            except NotFound:
                self.log.info("Cluster deleted.")
                return

However, the standard synchronous execute path does not seem to catch NotFound exceptions during the delete operation.

How to reproduce

  1. Create a DAG with DataprocCreateClusterOperator -> DataprocDeleteClusterOperator (with trigger_rule=TriggerRule.ALL_DONE).
  2. Force the cluster creation to enter an ERROR state (e.g., by providing invalid configuration that passes validation but fails provisioning).
  3. DataprocCreateClusterOperator will delete the cluster and fail.
  4. DataprocDeleteClusterOperator will run, attempt to delete the missing cluster, and fail with NotFound.

Anything else

Proposed behaviour:

  1. Update DataprocDeleteClusterOperator to catch NotFound exceptions during the delete operation and log a message instead of raising an error.
  2. Alternatively, update documentation to explicitly recommend setting delete_on_error=False in DataprocCreateClusterOperator when an explicit delete task is used.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions