You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Others:
Note that the AWS DataSync Operator is not available in this version, we manually added it via Plugins.
What happened:
AWS DataSync service had a problem resulting in the Task Execution being stuck in LAUNCHING for a long period of time.
DataSync Operator encounted a timeout exception (not an Airflow Timeout Exception but one from token expiry of the underlying boto3 service).
This exception caused the operator to terminate but the Task Execution on AWS was still stuck in LAUNCHING
Other Airflow Datasync Operator tasks started to pile up in QUEUED status and eventually timed out, also leaving their Task Executions in QUEUED state in AWS, blocked by the LAUNCHING task execution.
What you expected to happen:
The DataSync operator should by default cancel a task execution which is in progress - if the operator terminates for any reason.
The AWS DataSync service can only run 1 DataSync task at a time (even when a task uses multiple DataSync agents). So there is a risk to all other DataSync tasks if one task gets stuck, then any tasks submitted in future will not run.
So the operator should catch exceptions from the wait_for_task_execution and cancel the task before re-raising the exception.
How to reproduce it:
Very difficult to reproduce without an AWS account and DataSync appliance, and the uncommon error conditions which cause a task to get irrecoverably stuck.
Anything else we need to know:
I authored the DataSync operator and have a working AWS Account to test in. This issue can be assigned to me.
The text was updated successfully, but these errors were encountered:
Small improvements to DataSync operator.
Most notable is the ability of the operator to cancel an in progress task execution, eg if the Airflow task times out or is killed. This avoids a zombie issue when the AWS DataSync service can have a zombie task running even if Airflow's task has failed.
Also made some small changes to polling values. DataSync is a batch-based uploading service, it takes several minutes to operate so I changed the polling intervals from 5 seconds to 30 seconds and adjusted max_iterations to what I think is a more reasonable default.
closes: #11011
Apache Airflow version: 1.10.8
Environment:
uname -a
): Linux 3.10.0-957.el7.x86_64 Improving the search functionality in the graph view #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/LinuxNote that the AWS DataSync Operator is not available in this version, we manually added it via Plugins.
What happened:
AWS DataSync service had a problem resulting in the Task Execution being stuck in LAUNCHING for a long period of time.
DataSync Operator encounted a timeout exception (not an Airflow Timeout Exception but one from token expiry of the underlying boto3 service).
This exception caused the operator to terminate but the Task Execution on AWS was still stuck in LAUNCHING
Other Airflow Datasync Operator tasks started to pile up in QUEUED status and eventually timed out, also leaving their Task Executions in QUEUED state in AWS, blocked by the LAUNCHING task execution.
What you expected to happen:
The DataSync operator should by default cancel a task execution which is in progress - if the operator terminates for any reason.
The AWS DataSync service can only run 1 DataSync task at a time (even when a task uses multiple DataSync agents). So there is a risk to all other DataSync tasks if one task gets stuck, then any tasks submitted in future will not run.
So the operator should catch exceptions from the wait_for_task_execution and cancel the task before re-raising the exception.
How to reproduce it:
Very difficult to reproduce without an AWS account and DataSync appliance, and the uncommon error conditions which cause a task to get irrecoverably stuck.
Anything else we need to know:
I authored the DataSync operator and have a working AWS Account to test in. This issue can be assigned to me.
The text was updated successfully, but these errors were encountered: