Skip to content

GlueJobHook.get_job_state doesn't handle exceptions when fetching status #52152

@rawwar

Description

@rawwar

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

main

Apache Airflow version

main

Operating System

mac

Deployment

Other

Deployment details

No response

What happened

Currently, GlueJobHook's async_get_job_state and get_job_state does not handle any exceptions raised by get_job_run in botocore and aiobotocore. A customer has been facing intermittent issue and the GlueJobOperator is failing on Airflow, even when the Job was successful on AWS

[2025-06-21, 07:06:21 UTC] {baseoperator.py:1787} ERROR - Trigger failed:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", line 558, in cleanup_finished_triggers
    result = details["task"].result()
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", line 630, in run_trigger
    async for event in trigger.run():
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/triggers/glue.py", line 73, in run
    await hook.async_job_completion(self.job_name, self.run_id, self.verbose)
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 314, in async_job_completion
    job_run_state = await self.async_get_job_state(job_name, run_id)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 215, in async_get_job_state
    job_run = await client.get_job_run(JobName=job_name, RunId=run_id)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiobotocore/client.py", line 412, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (HttpTimeoutException) when calling the GetJobRun operation: Could not write request before timeout
[2025-06-21, 07:06:21 UTC] {taskinstance.py:3310} ERROR - Task failed with exception

What you think should happen instead

We should gracefully handle exceptions and add retries

How to reproduce

Its intermittent and not sure how to. But, we can reproduce this during development by adding a test. We can mock self.conn.get_job_run to raise an exception

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions