Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS DataSync Operator does not cancel task on Exception #11011

Closed
baolsen opened this issue Sep 18, 2020 · 2 comments · Fixed by #16589
Closed

AWS DataSync Operator does not cancel task on Exception #11011

baolsen opened this issue Sep 18, 2020 · 2 comments · Fixed by #16589
Assignees
Labels
area:providers kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues

Comments

@baolsen
Copy link
Contributor

baolsen commented Sep 18, 2020

Apache Airflow version: 1.10.8

Environment:

  • Cloud provider or hardware configuration: 4 VCPU 8GB RAM VM
  • OS (e.g. from /etc/os-release): RHEL 7.7
  • Kernel (e.g. uname -a): Linux 3.10.0-957.el7.x86_64 Improving the search functionality in the graph view #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
    Note that the AWS DataSync Operator is not available in this version, we manually added it via Plugins.

What happened:

AWS DataSync service had a problem resulting in the Task Execution being stuck in LAUNCHING for a long period of time.
DataSync Operator encounted a timeout exception (not an Airflow Timeout Exception but one from token expiry of the underlying boto3 service).
This exception caused the operator to terminate but the Task Execution on AWS was still stuck in LAUNCHING

Other Airflow Datasync Operator tasks started to pile up in QUEUED status and eventually timed out, also leaving their Task Executions in QUEUED state in AWS, blocked by the LAUNCHING task execution.

What you expected to happen:

The DataSync operator should by default cancel a task execution which is in progress - if the operator terminates for any reason.

The AWS DataSync service can only run 1 DataSync task at a time (even when a task uses multiple DataSync agents). So there is a risk to all other DataSync tasks if one task gets stuck, then any tasks submitted in future will not run.

So the operator should catch exceptions from the wait_for_task_execution and cancel the task before re-raising the exception.

How to reproduce it:

Very difficult to reproduce without an AWS account and DataSync appliance, and the uncommon error conditions which cause a task to get irrecoverably stuck.

Anything else we need to know:

I authored the DataSync operator and have a working AWS Account to test in. This issue can be assigned to me.

@baolsen baolsen added the kind:bug This is a clearly a bug label Sep 18, 2020
@baolsen
Copy link
Contributor Author

baolsen commented Sep 18, 2020

@kaxil please assign to me

@kaxil
Copy link
Member

kaxil commented Sep 18, 2020

Assigned :)

Btw let me if know you want to work on #10985 too (The new version of moto added support for Datasync) and tests are failing

@baolsen baolsen closed this as completed Sep 18, 2020
@baolsen baolsen reopened this Sep 18, 2020
@mik-laj mik-laj added area:providers provider:amazon-aws AWS/Amazon - related issues labels Sep 30, 2020
kaxil pushed a commit that referenced this issue Jun 24, 2021
Small improvements to DataSync operator. 

Most notable is the ability of the operator to cancel an in progress task execution, eg if the Airflow task times out or is killed. This avoids a zombie issue when the AWS DataSync service can have a zombie task running even if Airflow's task has failed. 

Also made some small changes to polling values. DataSync is a batch-based uploading service, it takes several minutes to operate so I changed the polling intervals from 5 seconds to 30 seconds and adjusted max_iterations to what I think is a more reasonable default.

closes: #11011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants