-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Closed
Copy link
Labels
area:providerskind:bugThis is a clearly a bugThis is a clearly a bugprovider:amazonAWS/Amazon - related issuesAWS/Amazon - related issues
Description
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
main
Apache Airflow version
main
Operating System
mac
Deployment
Other
Deployment details
No response
What happened
Currently, GlueJobHook's async_get_job_state and get_job_state does not handle any exceptions raised by get_job_run in botocore and aiobotocore. A customer has been facing intermittent issue and the GlueJobOperator is failing on Airflow, even when the Job was successful on AWS
[2025-06-21, 07:06:21 UTC] {baseoperator.py:1787} ERROR - Trigger failed:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", line 558, in cleanup_finished_triggers
result = details["task"].result()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", line 630, in run_trigger
async for event in trigger.run():
File "/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/triggers/glue.py", line 73, in run
await hook.async_job_completion(self.job_name, self.run_id, self.verbose)
File "/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 314, in async_job_completion
job_run_state = await self.async_get_job_state(job_name, run_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/hooks/glue.py", line 215, in async_get_job_state
job_run = await client.get_job_run(JobName=job_name, RunId=run_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/aiobotocore/client.py", line 412, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (HttpTimeoutException) when calling the GetJobRun operation: Could not write request before timeout
[2025-06-21, 07:06:21 UTC] {taskinstance.py:3310} ERROR - Task failed with exception
What you think should happen instead
We should gracefully handle exceptions and add retries
How to reproduce
Its intermittent and not sure how to. But, we can reproduce this during development by adding a test. We can mock self.conn.get_job_run to raise an exception
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
area:providerskind:bugThis is a clearly a bugThis is a clearly a bugprovider:amazonAWS/Amazon - related issuesAWS/Amazon - related issues