Add resume_glue_job_on_retry to GlueJobOperator by henry3260 · Pull Request #59392 · apache/airflow

henry3260 · 2025-12-13T19:05:02Z

Description

Add resume_glue_job_on_retry parameter to GlueJobOperator to prevent duplicate AWS Glue job runs during task retries.

Problem

When a GlueJobOperator task is retried after failure, the operator would always create a new AWS Glue job run, leading to:

Multiple concurrent job runs for the same task execution
Wasted resources and costs
Confusing job history and tracking

Solution

Introduce resume_glue_job_on_retry parameter that enables idempotent retry behavior:

When enabled, the operator checks if a previous job run is still in progress (RUNNING, STARTING, or STOPPING states)
If in progress, reuses the existing job_run_id instead of creating a new one
If the previous job is finished (SUCCEEDED, FAILED, etc.), creates a new job run as normal
Previous job state is tracked via XCom across retries

Changes Made

GlueJobOperator (glue.py):

Added resume_glue_job_on_retry: bool = False parameter to __init__
Enhanced execute() method to check previous job state from XCom when enabled
Queries AWS Glue API (get_job_run()) to verify job state before deciding to create new run
Proper exception handling for graceful fallback if XCom or Glue API calls fail

Unit Tests (test_glue.py):

test_check_previous_job_id_run_reuse_in_progress: Verifies previous job_run_id is reused when job is RUNNING
test_check_previous_job_id_run_new_on_finished: Verifies new job is created when previous job is SUCCEEDED

Backward Compatibility

Fully backward compatible - parameter defaults to False, maintaining existing behavior by default.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

potiuk · 2025-12-14T09:44:04Z

Looks good - but likely @vincbeck @o-nikolas @ferruzzi @ramitkataria should take a look

providers/amazon/src/airflow/providers/amazon/aws/operators/glue.py

wilsonhooi86 · 2026-01-08T13:14:46Z

Good Day@henry3260 ,

Happy New Year and thank you so much for taking the initiative to add this feature. It will be helpful.

I would like to clarify a specific scenario regarding a Glue job named glue_job_database_name_1. This job is designed to handle a single schema but uses a tbl_name argument to process different tables dynamically. The script logic adapts based on the table name passed during execution.

Assuming 1 dag, there are 3 GlueJobOperator calling the same glue job name glue_job_database_name_1 running in parallel.

Assuming task_id="table_1" and task_id="table_2" are still running glue jobs. If task_id="table_3" suddenly failed due to some internal error and retry again, will it be able to find back the same previous_glue_job_id and stop creating a new glue job?

table_1 = GlueJobOperator(
            task_id="table_1",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_1",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )
		
table_2 = GlueJobOperator(
            task_id="table_2",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_2",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )
		
table_3 = GlueJobOperator(
            task_id="table_3",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_3",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )

Thanks and let me know if you need further clarification

providers/amazon/src/airflow/providers/amazon/aws/operators/glue.py

shahar1 · 2026-02-10T13:14:33Z

@henry3260 Could you please address the open issues?

henry3260 · 2026-02-10T13:19:15Z

@henry3260 Could you please address the open issues?

Sorry for the late update. I'll address them shortly.

shahar1

LGTM! I'll merge if and when the CI is green.
While the CI is running, please try to avoid making additional changes so I could merge it right after it (hopefully) ends succesfully.

henry3260 · 2026-02-10T16:26:01Z

Hi! @wilsonhooi86 , Yes, it will find back the same previous_glue_job_id and stop creating a new glue job, because when GlueJobOperator retries, it will only find its own glue_job_run_id for the specific task_id.

Good Day@henry3260 ,

Happy New Year and thank you so much for taking the initiative to add this feature. It will be helpful.

I would like to clarify a specific scenario regarding a Glue job named glue_job_database_name_1. This job is designed to handle a single schema but uses a tbl_name argument to process different tables dynamically. The script logic adapts based on the table name passed during execution.

Assuming 1 dag, there are 3 GlueJobOperator calling the same glue job name glue_job_database_name_1 running in parallel.

Assuming task_id="table_1" and task_id="table_2" are still running glue jobs. If task_id="table_3" suddenly failed due to some internal error and retry again, will it be able to find back the same previous_glue_job_id and stop creating a new glue job?
table_1 = GlueJobOperator(
            task_id="table_1",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_1",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )
		
table_2 = GlueJobOperator(
            task_id="table_2",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_2",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )
		
table_3 = GlueJobOperator(
            task_id="table_3",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_3",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )
Thanks and let me know if you need further clarification

Hi! @wilsonhooi86 , Yes, it will find back the same previous_glue_job_id and stop creating a new glue job, because when GlueJobOperator retries, it will only find its own glue_job_run_id for the specific task_id.

henry3260 · 2026-02-10T16:28:12Z

LGTM! I'll merge if and when the CI is green. While the CI is running, please try to avoid making additional changes so I could merge it right after it (hopefully) ends succesfully.

Thanks <3

henry3260 requested a review from o-nikolas as a code owner December 13, 2025 19:05

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Dec 13, 2025

henry3260 force-pushed the fix-glueop branch from 79f4214 to 26a4247 Compare December 14, 2025 07:58

potiuk approved these changes Dec 14, 2025

View reviewed changes

o-nikolas reviewed Dec 15, 2025

View reviewed changes

providers/amazon/src/airflow/providers/amazon/aws/operators/glue.py Outdated Show resolved Hide resolved

henry3260 requested a review from o-nikolas December 21, 2025 16:22

uranusjr reviewed Jan 8, 2026

View reviewed changes

providers/amazon/src/airflow/providers/amazon/aws/operators/glue.py Outdated Show resolved Hide resolved

uranusjr reviewed Jan 8, 2026

View reviewed changes

providers/amazon/src/airflow/providers/amazon/aws/operators/glue.py Outdated Show resolved Hide resolved

feat: Add resume_glue_job_on_retry to GlueJobOperator

8e979ea

henry3260 force-pushed the fix-glueop branch from 26a4247 to 8e979ea Compare February 10, 2026 16:19

shahar1 approved these changes Feb 10, 2026

View reviewed changes

shahar1 changed the title ~~feat: Add resume_glue_job_on_retry to GlueJobOperator~~ Add resume_glue_job_on_retry to GlueJobOperator Feb 10, 2026

shahar1 merged commit 8396957 into apache:main Feb 10, 2026
90 checks passed

henry3260 deleted the fix-glueop branch February 10, 2026 16:56

shahar1 mentioned this pull request Feb 11, 2026

Status of testing Providers that were prepared on February 10, 2026 #61766

Closed

81 tasks

Alok-kumar-priyadarshi pushed a commit to Alok-kumar-priyadarshi/airflow that referenced this pull request Feb 11, 2026

Add resume_glue_job_on_retry to GlueJobOperator (apache#59392)

46ac478

Ratasa143 pushed a commit to Ratasa143/airflow that referenced this pull request Feb 15, 2026

Add resume_glue_job_on_retry to GlueJobOperator (apache#59392)

83792dc

choo121600 pushed a commit to choo121600/airflow that referenced this pull request Feb 22, 2026

Add resume_glue_job_on_retry to GlueJobOperator (apache#59392)

6b1270e

wilsonhooi86 mentioned this pull request Feb 23, 2026

GlueJobOperator - resume_glue_job_on_retry doesn't seem to work on MWAA 2.11.0 #62353

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add resume_glue_job_on_retry to GlueJobOperator#59392

Add resume_glue_job_on_retry to GlueJobOperator#59392
shahar1 merged 1 commit intoapache:mainfrom
henry3260:fix-glueop

henry3260 commented Dec 13, 2025 •

edited

Loading

Uh oh!

potiuk commented Dec 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

wilsonhooi86 commented Jan 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

shahar1 commented Feb 10, 2026

Uh oh!

henry3260 commented Feb 10, 2026

Uh oh!

shahar1 left a comment

Uh oh!

henry3260 commented Feb 10, 2026

Uh oh!

henry3260 commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

Conversation

henry3260 commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes Made

Backward Compatibility

Uh oh!

potiuk commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wilsonhooi86 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shahar1 commented Feb 10, 2026

Uh oh!

henry3260 commented Feb 10, 2026

Uh oh!

shahar1 left a comment

Choose a reason for hiding this comment

Uh oh!

henry3260 commented Feb 10, 2026

Uh oh!

henry3260 commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

henry3260 commented Dec 13, 2025 •

edited

Loading

potiuk commented Dec 14, 2025 •

edited

Loading

wilsonhooi86 commented Jan 8, 2026 •

edited

Loading