AWS ECS Executor #34381

o-nikolas · 2023-09-14T20:12:35Z

Overview

Introducing the AWS ECS executor!

Over the past couple months myself, @ferruzzi and @syedahsn have been hard at work on an AWS ECS executor (if you've not seen us as much in the community, this is why 😅), based on an initial contribution from @aelzeiny

From the headline sentence in the README:

This is an Airflow executor powered by Amazon Elastic Container Service (ECS). Each task that Airflow schedules for execution is run within its own ECS container.

This is an initial release with most of the basic functionality in place. We have many future upgrades, CLI commands, and other features planned, so stay tuned!

Review Notes

This is certainly a large PR, which in some ways is not ideal. However, it can be difficult to release portions of a component like this. We've tried to scope all the code here to the minimum required for folks to begin using this executor.

When reviewing, I don't think it's required to read every single line of code. I have annotated most modules in the diff view with comments explaining the changes. Please take a look at the portions you think are relevant or that you have some experience in.

Ultimately, (almost) all the code is scoped to the Amazon provider package and it is a net new component, so there is a very limited blast radius. Very little existing user workflows or code should be affected. Included in the list below are the areas of existing code that have been updated, which have the possibility of affecting user workflows.

Some specific changes to pay particular attention to:

The new logging environment variable to enable task logs to work correctly in containerized executors. K8s has an approach for this, which we follow quite closely when making it generic. However, we did not convert K8s to this new mechanism to keep the change set minimal and reduce blast radius.
Updates to boto user agent tagging in the Base Aws hook.
A config yaml is now present in the AWS provider, leveraging @potiuk's changes, this is new code, but the first time we're leveraging this system so it's worth reviewing carefully

Testing

There is extensive unit testing which has near 100% line coverage in most cases:

We also performed a lot of manual UAT for things such as:

Load testing (500 concurrent tasks, reaching the limit imposed by ECS, see the performance and tuning section of the README)
Different platform versions of ECS Fargate
Multiple ways to provide AWS credentials to the executor (built into the image, using Airflow connection, etc)
Deferrable operators
Dynamic task mapping
Data Driven Scheduling

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

…eExecutor and BatchExecutor launches Airflow tasks on AWS ECS/Fargate service and the AWS Batch service, respectively.

Run static checks and make all the obvious fixes. Some will still fail until we have a better understanding of the code.

I have not yet checked for coverage or quality. I did the minimum required to: convert the tests, make sure they still pass, and get static checks happy.

Also moved the Executor logic into __init__ for shorter and more logical import statements and updated relevant unit tests.

…references to ECS which meant EC2 Also made some minor docstring tweaks

We do not validate the format of the returned json anywhere else, removing for consistency. It may not be a terrible idea to implement that universally at a later date.

There is an existing env var AIRFLOW_IS_K8S_EXECUTOR_POD which the kubernetes executor and pod operator use to enable logging, among other things. This CR adds a similar, generic, env var that new container based executors can use. Note: This change intentionally does _not_ deprecate AIRFLOW_IS_K8S_EXECUTOR_POD since it is leveraged for other usecases.

Added conn_id and alphabetized the entries for easier use.

Boto3 is a particularly slow import, so import that locally. Also move helper classes out of main executor module to utils. This isn't strictly necessary for import speed optimization, but I prefer this logical separation.

The logging section is the only thing flushed out so far

Also implemented AWS_CONN_ID config setting as it is used for the Hook.

Also include a bullet for updating the airflow.cfg

The subnet config test will fail when run with the other tests in the module (e.g. just running pytest on the whole test module). This is because python only loads a module once and the ecs config module runs its code and sets its values at the module level, so it's behaviour is unchanged for the remainder of the python session. So the module must be reloaded to test any changes.

The attribute is expected to be a boto client, but was recently changed to be a hook, which caused all api requests to fail. Updated the attribute to be the Hook's conn (which is all we care about). While I was there I moved all the intialization from start() to __init__(). I don't see any reason we need to keep them separate

Add a README for the Dockerfile

Also fixes wonky dependency on Region to make it more generic and added an extensive docstring explaining on why it is done this way.

If we don't catch exceptions in those interface methods, they bubble up and kill the scheduler process itself. Also catch any exception during the attempt to actually run tasks, if there is any failure, catch it and log as well as add the task back to the pending queue

Some issues included: - Some config options which should be nested in embedded dicts/lists were left at the root of the run task kwargs dict (for example subnets) - The run task kwargs "override" was being overridden by us, since name and command must be a specific value. - Unit tests were not catching the above.

ferruzzi · 2023-10-21T00:22:23Z

working on the merge conflict now

# Conflicts: # docs/spelling_wordlist.txt

o-nikolas · 2023-10-25T03:59:05Z

@jedcunningham @Taragolis @eladkal,

We've address the feedback and gotten the build green once again. If you folks have time to have another look and perhaps approve if all looks good that would be greatly appreciated.

We've noted the executor as experimental in the docs (thanks again for that suggestion @eladkal), so we can still iterate on this for a while. and we'd like to get the bulk of it merged soon.

Thanks!

eladkal

LGTM!

eladkal · 2023-10-25T17:32:52Z

🎉 🎉 🎉 🎉 🎉 🎉

Taragolis · 2023-10-25T17:40:10Z

Nice.

Still try to find time to actually try it 🤣

potiuk · 2023-10-25T19:33:27Z

Nice!

mshober · 2023-11-06T19:49:32Z

Hi @o-nikolas. Just noticed this PR today. I'd like to give my feedback on this.

I've been using a fork of @aelzeiny's ECS executor for my company which runs about 180k tasks per day on our Airflow environment. We don't use Fargate; it is far too slow and expensive for the amount of tasks that we run. The original ECS Executor was very coupled to Fargate so much of the work I did was related to removing the Fargate specifics.

I'd love to use Airflow's official ECS Executor instead of having to support my own, but the current implementation is not suitable for my use case (and likely others).

Here's a few features that are must-haves for me:

Supporting Capacity Providers

The ECS Executor in this PR does not support Capacity Providers. It requires that users specify a LaunchType, and if you're specifying a LaunchType then you cannot specify a Capacity Provider.

Using Capacity Providers is essential for organizations that use EC2-hosted ECS Clusters. If I can recall correctly, running tasks with EC2 LaunchType will result in launch failures if there is not enough capacity to run them, whereas tasks ran with Capacity Providers will go into a Provisioning state until there is enough capacity available to run the task. Capacity Providers are also very cost efficient since you can use them to dynamically scale your EC2 instances based on demand.

Overriding Additional ECS Task Properties

The executor config is scoped to overrides.containerOverrides. However there are relevant properties outside of overrides.containerOverrides that users may want to change.

For example, our ECS Cluster is actually composed of 3 capacity providers: A General-Purpose Capacity Provider (which is our cluster's default provider and runs on M7g instances), Memory-Optimized (R7g instances) and Compute-Optimized (C7g instances). My version of the ECS Executor allows users to set the appropriate Capacity Provider via the operator's executor_config param so that we can run our jobs in the most cost-efficient environment.

There are several other properties which airflow uses may want to set on a task-level, such as:

Adopting Task Instances

This is a must-have for us, as our deployments replace our scheduler instances.

There is a PR for that feature in @aelzeiny's executor. I had to make some changes to get that working properly. I can assist on a PR for this feature.

Increasing Throughput

The ECS Executor calls the ECS RunTask API sequentially. On our current environment, this leads to a maximum throughput of roughly 4 tasks launched per second per scheduler instance. This can cause issues for larger airflow environments like my own, for example:

During peak times tasks often spend a long period in the scheduled state despite there being available capacity in the environment.
Larger values for max_tis_per_query can lead to missed heartbeats from the length of time the Executor is spent calling the RunTask API.

I haven't had a chance to implement an improvement for this in my own executor yet, but my thinking was to incorporate the same sync_parallelism logic that is currently used for the CeleryExecutor.

Taragolis · 2023-11-06T20:04:37Z

@mshober

This executor is experimental which will work with Airflow 2.8 (not released yet)

I'd love to use Airflow's official ECS Executor instead of having to support my own, but the current implementation is not suitable for my use case (and likely others).

The nice part that this Executor is a part of community provider, so everyone could propose the changes and directly contribute improvements by making a PR

o-nikolas · 2023-11-06T23:19:39Z

I'd love to use Airflow's official ECS Executor instead of having to support my own, but the current implementation is not suitable for my use case (and likely others).

Hey thanks for the feedback! We're working on adding more features to this executor and we welcome any PRs for code changes that you've made which you find are working well for you and your organization :)

Supporting Capacity Providers

The ECS Executor in this PR does not support Capacity Providers. It requires that users specify a LaunchType, and if you're specifying a LaunchType then you cannot specify a Capacity Provider.

This is a great request, it should be a good first issue, I'll cut a Github Issue for it. If you have code for it, feel free to submit a PR

The executor config is scoped to overrides.containerOverrides. However there are relevant properties outside of overrides.containerOverrides that users may want to change.

For example, our ECS Cluster is actually composed of 3 capacity providers: A General-Purpose Capacity Provider (which is our cluster's default provider and runs on M7g instances), Memory-Optimized (R7g instances) and Compute-Optimized (C7g instances). My version of the ECS Executor allows users to set the appropriate Capacity Provider via the operator's executor_config param so that we can run our jobs in the most cost-efficient environment.

This is also a good request, we should make it while the executor is in Experimental mode and we can still change that behaviour easily. I'll cut a Github Issue for it. If you have code for it, feel free to submit a PR

Adopting Task Instances

This is a must-have for us, as our deployments replace our scheduler instances.

There is a PR for that feature in @aelzeiny's executor. I had to make some changes to get that working properly. I can assist on a PR for this feature.

Would definitely appreciate a PR!

Increasing Throughput

The ECS Executor calls the ECS RunTask API sequentially. On our current environment, this leads to a maximum throughput of roughly 4 tasks launched per second per scheduler instance. This can cause issues for larger airflow environments like my own, for example:
* During peak times tasks often spend a long period in the scheduled state despite there being available capacity in the environment.

* Larger values for `max_tis_per_query` can lead to missed heartbeats from the length of time the Executor is spent calling the RunTask API.
I haven't had a chance to implement an improvement for this in my own executor yet, but my thinking was to incorporate the same sync_parallelism logic that is currently used for the CeleryExecutor.

Indeed some performance tuning can be done with some of the nobs of Airflow. I personally have gotten some good results scheduling 500 tasks in a few minutes by increasing the max_tis_per_query and relaxing the scheduler heartbeat a little, as well as some other configs. But to get double digit tasks scheduled per second for those very very large scale deployments will indeed likely need some code changes to the executor (again PRs welcome 😀).

mshober · 2023-11-10T17:10:23Z

Thanks so much for your feedback @o-nikolas.

I'll hopefully have some time this weekend to tackle some of those tasks.

aelzeiny and others added 30 commits September 13, 2023 16:48

Introduce AWS Executors to the Amazon provider package. The EcsFargat…

9ff62d8

…eExecutor and BatchExecutor launches Airflow tasks on AWS ECS/Fargate service and the AWS Batch service, respectively.

Reducing scope to just ECS, following up with Batch executor later

226e6ff

Initial static checks

aca382a

Run static checks and make all the obvious fixes. Some will still fail until we have a better understanding of the code.

Convert provided unit tests from UnitTest to pytest

5e02868

I have not yet checked for coverage or quality. I did the minimum required to: convert the tests, make sure they still pass, and get static checks happy.

Docstring fixes

22274ea

Nested ECS Executor to allow future executors to be added easier

ca2b7e1

Also moved the Executor logic into __init__ for shorter and more logical import statements and updated relevant unit tests.

Remove references to "Fargate" where it wasn't necessary and correct …

62b14cc

…references to ECS which meant EC2 Also made some minor docstring tweaks

Rewrite TestEcsTaskCollection Tests

b7457c0

Remove botocore helpers

575387c

We do not validate the format of the returned json anywhere else, removing for consistency. It may not be a terrible idea to implement that universally at a later date.

Break anything not explicitly Executor related into other modules

20a0c7c

Changes to the executor config files

455296a

Added conn_id and alphabetized the entries for easier use.

Add unit test for config default values

3bbc42a

Optimize import speed of ECS Executor

99b5d7c

Boto3 is a particularly slow import, so import that locally. Also move helper classes out of main executor module to utils. This isn't strictly necessary for import speed optimization, but I prefer this logical separation.

First draft for ECS Executor README

466bcc7

The logging section is the only thing flushed out so far

Improve unit test code coverage up to 98%

7847c85

Convert to using EcsHook

dca2837

Also implemented AWS_CONN_ID config setting as it is used for the Hook.

Add a link to the airflow doc about setting config options

08089ea

Also include a bullet for updating the airflow.cfg

Replace mock executor in unit tests

94be3b6

fix typo in docstring

50e5404

Add config options to the readme

dd700bb

Add a Dockerfile to build the ECS image that will run Airflow tasks

7d02806

Add a README for the Dockerfile

Fixes for README Dockerfile static check failures

23db81f

Update the order the options are applied: default < template < explicit

6dc7144

Also fixes wonky dependency on Region to make it more generic and added an extensive docstring explaining on why it is done this way.

Add guide to set up ECS Executors

4b8365e

Add instructions on how to check Python version for Airflow image.

5b02b5b

ferruzzi and others added 3 commits October 16, 2023 15:01

Use inflection.camelize instead of custom helper

40f0458

Move config into provider.yaml

7abbb97

Changes for converting docs to rs

d507073

btylerburton mentioned this pull request Oct 19, 2023

[spike: 4d] Standing up Airflow ECS Executor POC on Cloud.gov GSA/data.gov#4503

Closed

3 tasks

o-nikolas and others added 2 commits October 20, 2023 15:04

Add top level Executors index page to amazon docs

e25b084

Update version_added

5d42a93

ferruzzi and others added 8 commits October 20, 2023 17:25

Merge branch 'main' into aws_executors/ecs

8945379

# Conflicts: # docs/spelling_wordlist.txt

typos

b939730

Update doc to clarify consistent configuration

a8cdf1a

Fix config loading

bcaf9de

Merge branch 'main' into aws_executors/ecs

59e1f6f

Missed a region --> region_name refactor in base aws hook tests

e5409c5

Remove Setup_guide readme, the contents have been moved to a docs rst

bdfd8be

No ecs executor configs are sensitive, exclude auto detected kwargs conf

5e44514

eladkal approved these changes Oct 25, 2023

View reviewed changes

eladkal merged commit 5f4d2b5 into apache:main Oct 25, 2023
1 check passed

eladkal mentioned this pull request Oct 28, 2023

Status of testing Providers that were prepared on October 28, 2023 #35240

Closed

27 tasks

This was referenced Nov 6, 2023

ECS Executor - Supporting Capacity Providers #35489

Closed

ECS Executor - Overriding Additional ECS Task Properties #35490

Closed

ECS Executor - Adopting Task Instances #35491

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS ECS Executor #34381

AWS ECS Executor #34381

o-nikolas commented Sep 14, 2023

ferruzzi commented Oct 21, 2023

o-nikolas commented Oct 25, 2023

eladkal left a comment

eladkal commented Oct 25, 2023

Taragolis commented Oct 25, 2023

potiuk commented Oct 25, 2023

mshober commented Nov 6, 2023

Taragolis commented Nov 6, 2023

o-nikolas commented Nov 6, 2023

Supporting Capacity Providers

Adopting Task Instances

Increasing Throughput

mshober commented Nov 10, 2023

AWS ECS Executor #34381

AWS ECS Executor #34381

Conversation

o-nikolas commented Sep 14, 2023

Overview

Review Notes

Testing

ferruzzi commented Oct 21, 2023

o-nikolas commented Oct 25, 2023

eladkal left a comment

Choose a reason for hiding this comment

eladkal commented Oct 25, 2023

Taragolis commented Oct 25, 2023

potiuk commented Oct 25, 2023

mshober commented Nov 6, 2023

Supporting Capacity Providers

Overriding Additional ECS Task Properties

Adopting Task Instances

Increasing Throughput

Taragolis commented Nov 6, 2023

o-nikolas commented Nov 6, 2023

Supporting Capacity Providers

Adopting Task Instances

Increasing Throughput

mshober commented Nov 10, 2023