Skip to content

Conversation

@Shlomit-B
Copy link
Contributor

@Shlomit-B Shlomit-B commented Jul 3, 2025

Summary

Adds full support for AWS Systems Manager (SSM) Run Command in Apache Airflow, including:

  • SsmRunCommandOperator — sends commands to resources via SSM

  • SsmRunCommandCompletedSensor — waits for command completion using list_command_invocations.

    • Checks the status of all target resources for the given command invocation.
    • It waits until all resources have completed the command successfully.
    • If any resource reports a failure state, the sensor will fail immediately, providing faster and clearer feedback on execution issues.
  • SsmRunCommandTrigger — enables deferrable execution

  • SsmCommandWaiter — internal waiter for command completion logic

  • Unit tests for each component: operator, sensor, trigger, and waiter

  • System test that exercises the full flow on a live EC2 instance

Also included

  • Refactored an EC2 task function into ec2_utils.py for reuse between tests

Sensor design choices and rationale

When implementing SsmRunCommandCompletedSensor, I considered several approaches for handling command status across multiple target resources:

  • Checking only one resource’s status was dismissed early on. In most cases, users expect the sensor to reflect the overall success of the command. If even one instance fails, it's likely the user wouldn’t want downstream tasks to continue.
  • (Chosen) Fail immediately if any resource fails: This approach provides faster feedback and aligns with the principle of failing fast. It prevents downstream tasks from executing when even a single resource didn’t succeed, which reflects real-world expectations in multi-instance workflows.

This logic improves reliability and better aligns with common use cases for SSM Run Command in production environments.

Open question for reviewers

Would it make sense to make the sensor behavior configurable?
For example, by adding a parameter like fail_on_any_failure=True to let users choose whether to fail immediately or wait for all command invocations to complete before deciding.

Currently, the sensor always fails as soon as one resource reports failure.
This seems like the safer default, but I’m open to making it user-configurable if that would better serve broader use cases.

closes: #42619


Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Jul 3, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@Shlomit-B Shlomit-B mentioned this pull request Jul 3, 2025
2 tasks
@eladkal eladkal requested review from ferruzzi and vincbeck July 3, 2025 12:07
@Shlomit-B
Copy link
Contributor Author

Hey @vincbeck, just a gentle reminder on this one — let me know if there's anything else you'd like me to adjust.

@vincbeck
Copy link
Contributor

Hey @vincbeck, just a gentle reminder on this one — let me know if there's anything else you'd like me to adjust.

Sorry I missed that! LGTM :)

@vincbeck vincbeck merged commit 1237634 into apache:main Jul 25, 2025
3 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 25, 2025

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@seanghaeli
Copy link
Contributor

Hey @Shlomit-B , any context on how I can run this test? What resources do I need

@Shlomit-B
Copy link
Contributor Author

Hey @Shlomit-B , any context on how I can run this test? What resources do I need

Hey, to run this test you’ll need:

  1. IAM role with the AmazonSSMManagedInstanceCore policy attached and a trust policy allowing ec2.amazonaws.com to assume the role
  2. An AWS user with rights for EC2, IAM (instance profiles), and SSM

Then pass the role ARN via the system test context variable ROLE_ARN

@seanghaeli
Copy link
Contributor

Thanks @Shlomit-B, in case you're interested, we regularly run system tests on our testing infrastructure and you can see the status of the ssm test updated daily here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add SSMRunCommandOperator

4 participants