Skip to content

Comments

Add resume_glue_job_on_retry to GlueJobOperator#59392

Merged
shahar1 merged 1 commit intoapache:mainfrom
henry3260:fix-glueop
Feb 10, 2026
Merged

Add resume_glue_job_on_retry to GlueJobOperator#59392
shahar1 merged 1 commit intoapache:mainfrom
henry3260:fix-glueop

Conversation

@henry3260
Copy link
Contributor

@henry3260 henry3260 commented Dec 13, 2025

closes: #59075

Description

Add resume_glue_job_on_retry parameter to GlueJobOperator to prevent duplicate AWS Glue job runs during task retries.

Problem

When a GlueJobOperator task is retried after failure, the operator would always create a new AWS Glue job run, leading to:

  • Multiple concurrent job runs for the same task execution
  • Wasted resources and costs
  • Confusing job history and tracking

Solution

Introduce resume_glue_job_on_retry parameter that enables idempotent retry behavior:

  1. When enabled, the operator checks if a previous job run is still in progress (RUNNING, STARTING, or STOPPING states)
  2. If in progress, reuses the existing job_run_id instead of creating a new one
  3. If the previous job is finished (SUCCEEDED, FAILED, etc.), creates a new job run as normal
  4. Previous job state is tracked via XCom across retries

Changes Made

GlueJobOperator (glue.py):

  • Added resume_glue_job_on_retry: bool = False parameter to __init__
  • Enhanced execute() method to check previous job state from XCom when enabled
  • Queries AWS Glue API (get_job_run()) to verify job state before deciding to create new run
  • Proper exception handling for graceful fallback if XCom or Glue API calls fail

Unit Tests (test_glue.py):

  • test_check_previous_job_id_run_reuse_in_progress: Verifies previous job_run_id is reused when job is RUNNING
  • test_check_previous_job_id_run_new_on_finished: Verifies new job is created when previous job is SUCCEEDED

Backward Compatibility

Fully backward compatible - parameter defaults to False, maintaining existing behavior by default.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@potiuk
Copy link
Member

potiuk commented Dec 14, 2025

Looks good - but likely @vincbeck @o-nikolas @ferruzzi @ramitkataria should take a look

@henry3260 henry3260 requested a review from o-nikolas December 21, 2025 16:22
@wilsonhooi86
Copy link

wilsonhooi86 commented Jan 8, 2026

Good Day@henry3260 ,

Happy New Year and thank you so much for taking the initiative to add this feature. It will be helpful.

I would like to clarify a specific scenario regarding a Glue job named glue_job_database_name_1. This job is designed to handle a single schema but uses a tbl_name argument to process different tables dynamically. The script logic adapts based on the table name passed during execution.

Assuming 1 dag, there are 3 GlueJobOperator calling the same glue job name glue_job_database_name_1 running in parallel.

Assuming task_id="table_1" and task_id="table_2" are still running glue jobs. If task_id="table_3" suddenly failed due to some internal error and retry again, will it be able to find back the same previous_glue_job_id and stop creating a new glue job?

table_1 = GlueJobOperator(
            task_id="table_1",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_1",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )
		
table_2 = GlueJobOperator(
            task_id="table_2",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_2",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )
		
table_3 = GlueJobOperator(
            task_id="table_3",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_3",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )

Thanks and let me know if you need further clarification

@shahar1
Copy link
Contributor

shahar1 commented Feb 10, 2026

@henry3260 Could you please address the open issues?

@henry3260
Copy link
Contributor Author

@henry3260 Could you please address the open issues?

Sorry for the late update. I'll address them shortly.

Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll merge if and when the CI is green.
While the CI is running, please try to avoid making additional changes so I could merge it right after it (hopefully) ends succesfully.

@henry3260
Copy link
Contributor Author

Hi! @wilsonhooi86 , Yes, it will find back the same previous_glue_job_id and stop creating a new glue job, because when GlueJobOperator retries, it will only find its own glue_job_run_id for the specific task_id.

Good Day@henry3260 ,

Happy New Year and thank you so much for taking the initiative to add this feature. It will be helpful.

I would like to clarify a specific scenario regarding a Glue job named glue_job_database_name_1. This job is designed to handle a single schema but uses a tbl_name argument to process different tables dynamically. The script logic adapts based on the table name passed during execution.

Assuming 1 dag, there are 3 GlueJobOperator calling the same glue job name glue_job_database_name_1 running in parallel.

Assuming task_id="table_1" and task_id="table_2" are still running glue jobs. If task_id="table_3" suddenly failed due to some internal error and retry again, will it be able to find back the same previous_glue_job_id and stop creating a new glue job?

table_1 = GlueJobOperator(
            task_id="table_1",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_1",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )
		
table_2 = GlueJobOperator(
            task_id="table_2",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_2",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )
		
table_3 = GlueJobOperator(
            task_id="table_3",
            job_name="glue_job_database_name_1",
            verbose=False,
            script_args={
                "--tbl_name": "table_3",
            },
	    resume_glue_job_on_retry=True,
            retry_limit=3,
        )

Thanks and let me know if you need further clarification

Hi! @wilsonhooi86 , Yes, it will find back the same previous_glue_job_id and stop creating a new glue job, because when GlueJobOperator retries, it will only find its own glue_job_run_id for the specific task_id.

@henry3260
Copy link
Contributor Author

LGTM! I'll merge if and when the CI is green. While the CI is running, please try to avoid making additional changes so I could merge it right after it (hopefully) ends succesfully.

Thanks <3

@shahar1 shahar1 changed the title feat: Add resume_glue_job_on_retry to GlueJobOperator Add resume_glue_job_on_retry to GlueJobOperator Feb 10, 2026
@shahar1 shahar1 merged commit 8396957 into apache:main Feb 10, 2026
90 checks passed
@henry3260 henry3260 deleted the fix-glueop branch February 10, 2026 16:56
Alok-kumar-priyadarshi pushed a commit to Alok-kumar-priyadarshi/airflow that referenced this pull request Feb 11, 2026
Ratasa143 pushed a commit to Ratasa143/airflow that referenced this pull request Feb 15, 2026
choo121600 pushed a commit to choo121600/airflow that referenced this pull request Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deferred Glue operators generate a new glue job ID upon task failure even existing glue job ID still running

6 participants