Skip to content

EC2CreateInstanceOperator leaks EC2 instance on failure with partial IAM permissions #60903

@SameerMesiah97

Description

@SameerMesiah97

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==9.20.0

Apache Airflow version

main

Operating System

Debian GNU/Linux 12 (bookworm)

Deployment

Other

Deployment details

No response

What happened

When using EC2CreateInstanceOperator, an EC2 instance may be successfully created even when the task execution role has partial EC2 permissions, for example lacking ec2:DescribeInstances.

In this scenario, the operator successfully calls RunInstances and creates the EC2 instance. However, subsequent calls (such as describing or waiting for the instance when wait_for_completion=True) fail due to insufficient permissions. The task then fails, but the EC2 instance continues to exist and remains running in AWS, resulting in leaked infrastructure.

What you think should happen instead

If the operator fails after successfully creating an EC2 instance (for example due to missing DescribeInstances or other follow-up permissions), it should make a best-effort attempt to clean up the partially created resource by terminating the instance.

Cleanup should be attempted opportunistically (i.e. only if the instance ID is known and the necessary permissions are available), and failure to clean up should not mask or replace the original exception.

How to reproduce

  1. Create an IAM role that allows ec2:RunInstances but denies ec2:DescribeInstances.
  2. Configure an AWS connection in Airflow using this role.
  3. Use the following DAG:
from datetime import datetime

from airflow import DAG
from airflow.providers.amazon.aws.operators.ec2 import EC2CreateInstanceOperator


with DAG(
    dag_id="ec2_partial_auth_leak_repro",
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
) as dag:
    create_instance = EC2CreateInstanceOperator(
        task_id="create_instance",
        aws_conn_id="aws_test_conn",
        image_id="ami-xxxxxxxxxxxxxxxxx",
        min_count=1,
        max_count=1,
        config={
            "SubnetId": "subnet-xxxxxxxxxxxxxxxxx",  # public subnet
            "SecurityGroupIds": ["sg-xxxxxxxxxxxxxxxxx"],
            "InstanceType": "t3.micro",
        },
        wait_for_completion=True,  # triggers DescribeInstances via waiter
    )
  1. Trigger the DAG.

Expected Result
The task fails due to missing DescribeInstances permissions but the EC2 instance remains running in AWS and is not terminated automatically.

Anything else

This behavior can be surprising and potentially costly, as infrastructure is created even though the Airflow task fails. Other Airflow operators that manage external resources typically attempt best-effort cleanup on failure to avoid leaking infrastructure.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions