Skip to content

Conversation

@ksharlandjiev
Copy link
Contributor

Problem

SSM operators currently fail when commands return non-zero exit codes, making it impossible to:

  • Route workflows based on different exit codes
  • Handle commands where non-zero exit codes represent valid business states (e.g., partial success, warnings)
  • Implement conditional retry logic based on specific exit codes
  • Migrate from traditional schedulers like Autosys that support exit code routing

Users have been forced to implement manual polling workarounds with custom Python tasks to handle these scenarios.

Proposal

Add a fail_on_nonzero_exit parameter (default: True) to SsmRunCommandOperator, SsmRunCommandCompletedSensor, and SsmRunCommandTrigger.

When set to False:

  • Tasks complete successfully regardless of command exit codes
  • Exit codes can be retrieved with SsmGetCommandInvocationOperator for routing decisions
  • AWS-level failures (TimedOut, Cancelled) still raise exceptions
  • Command-level failures (non-zero exit codes) are tolerated

The default value of True maintains existing behavior for backward compatibility.

Kamen Sharlandjiev added 6 commits November 3, 2025 12:04
Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator
and SsmRunCommandTrigger to allow workflows to continue when SSM
commands return non-zero exit codes, enabling exit-code-based
workflow routing.

When fail_on_nonzero_exit=False:
- Command-level failures (non-zero exit codes) are tolerated
- AWS-level failures (Cancelled, TimedOut) still raise exceptions
- Helpful log messages indicate the command status and exit code

The parameter defaults to True to maintain backward compatibility
with existing DAGs. This change supports both traditional (sync)
and deferrable (async) execution modes.
Enhance SSM sensor and trigger to support exit code routing patterns
by adding a fail_on_nonzero_exit parameter that allows DAGs to handle
command-level failures gracefully while still failing on AWS-level errors.

Changes:
- Add fail_on_nonzero_exit parameter to SsmSensor with default True
- Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True
- Add parameter to trigger's serialized_fields for proper serialization
- Implement inline status checks to distinguish AWS-level vs command-level failures
- AWS-level failures (Cancelled, TimedOut) always raise exceptions
- Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False
- Add comprehensive logging for both error and success scenarios in enhanced mode
- Maintain 100% backward compatibility (default behavior unchanged)
Add  documentation and tests to clarify that
SsmGetCommandInvocationOperator provides all functionality
needed for exit code-based workflow routing.

Changes:
- Enhanced operator docstring with exit code routing use case examples
- Added workflow pattern documentation showing integration with enhanced mode
- Documented return value structure highlighting response_code field
- Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval
- Added test_multiple_exit_codes_for_routing for multi-instance scenarios
Add detailed documentation for the enhanced exit code handling feature
in SSM operators and sensors, including usage patterns, migration guide,
and best practices.

The new documentation covers:
- Overview of enhanced vs traditional operational modes
- Four usage patterns with code examples (async, sync, routing, traditional)
- Complete parameter reference with behavior tables
- Migration guide from manual polling anti-patterns
- Best practices for different use cases
- Common use cases and troubleshooting guidance
Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general comments:

  • I think the idea is great, that is indeed a feature user might want. Thanks for creating a PR for that
  • Creating documentation is great, however, I think the document is way too long. This is only my personal opinion so I would wait to see what others think but if for new parameter we are creating documentation that big, this will be impossible to maintain. Using AI to create code and/or documentation is great but we should also keep in mind, the longer is NOT the better. Again, I support the documentation, but this is way too big to me. As a user I wont probably read it all, and as a developer I am scared we need to maintain that
  • The system test is a great idea, could you please move these 3 examples in the current system test?

@ksharlandjiev
Copy link
Contributor Author

My general comments:

* I think the idea is great, that is indeed a feature user might want. Thanks for creating a PR for that

* Creating documentation is great, however, I think the document is way too long. This is only my personal opinion so I would wait to see what others think but if for new parameter we are creating documentation that big, this will be impossible to maintain. Using AI to create code and/or documentation is great but we should also keep in mind, the longer is NOT the better. Again, I support the documentation, but this is way too big to me. As a user I wont probably read it all, and as a developer I am scared we need to maintain that

* The system test is a great idea, could you please move these 3 examples in the current system test?

Thanks for your feedback. I was on the fence myself on the extra docs, and I understand the concern. I'm happy to move all documented patterns to an external article.

@ksharlandjiev ksharlandjiev marked this pull request as draft November 4, 2025 16:45
Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just looked through the docs and source code so far (not tests yet). I like the shortened docs.

Just one question: Can this be done today with trigger rules?

Kamen Sharlandjiev and others added 9 commits November 5, 2025 09:04
…hook method

- Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic
- Update SsmRunCommandOperator to use the new hook method instead of inline status checks
- Update SsmRunCommandCompletedSensor to use the new hook method for consistency
- Update SsmRunCommandTrigger to use the new hook method for consistency
- Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses
- Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers
- Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed)
… AWS SSM status value as per AWS documentation.
@ksharlandjiev ksharlandjiev marked this pull request as ready for review December 8, 2025 15:10
Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run the system test to ensue that it's working correctly?

@ksharlandjiev
Copy link
Contributor Author

ksharlandjiev commented Dec 10, 2025

Have you run the system test to ensue that it's working correctly?

Thanks for the approval! I’ve added a few additional tests to the system test to cover this change, following @vincbeck’s feedback, and I can confirm that everything executes successfully.

Copy link
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit but overall looks good!

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
@vincbeck vincbeck merged commit f622936 into apache:main Jan 7, 2026
127 checks passed
@ksharlandjiev ksharlandjiev deleted the ssm-exit-code-handling branch January 7, 2026 16:57
dstandish pushed a commit to astronomer/airflow that referenced this pull request Jan 7, 2026
…ting (apache#57753)

* Add fail_on_nonzero_exit parameter to SsmRunCommandOperator

Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator
and SsmRunCommandTrigger to allow workflows to continue when SSM
commands return non-zero exit codes, enabling exit-code-based
workflow routing.

When fail_on_nonzero_exit=False:
- Command-level failures (non-zero exit codes) are tolerated
- AWS-level failures (Cancelled, TimedOut) still raise exceptions
- Helpful log messages indicate the command status and exit code

The parameter defaults to True to maintain backward compatibility
with existing DAGs. This change supports both traditional (sync)
and deferrable (async) execution modes.

* Add fail_on_nonzero_exit parameter to SSM sensor and trigger

Enhance SSM sensor and trigger to support exit code routing patterns
by adding a fail_on_nonzero_exit parameter that allows DAGs to handle
command-level failures gracefully while still failing on AWS-level errors.

Changes:
- Add fail_on_nonzero_exit parameter to SsmSensor with default True
- Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True
- Add parameter to trigger's serialized_fields for proper serialization
- Implement inline status checks to distinguish AWS-level vs command-level failures
- AWS-level failures (Cancelled, TimedOut) always raise exceptions
- Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False
- Add comprehensive logging for both error and success scenarios in enhanced mode
- Maintain 100% backward compatibility (default behavior unchanged)

* Document SsmGetCommandInvocationOperator for exit code routing

Add  documentation and tests to clarify that
SsmGetCommandInvocationOperator provides all functionality
needed for exit code-based workflow routing.

Changes:
- Enhanced operator docstring with exit code routing use case examples
- Added workflow pattern documentation showing integration with enhanced mode
- Documented return value structure highlighting response_code field
- Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval
- Added test_multiple_exit_codes_for_routing for multi-instance scenarios

* Add  documentation for SSM exit code handling

Add detailed documentation for the enhanced exit code handling feature
in SSM operators and sensors, including usage patterns, migration guide,
and best practices.

The new documentation covers:
- Overview of enhanced vs traditional operational modes
- Four usage patterns with code examples (async, sync, routing, traditional)
- Complete parameter reference with behavior tables
- Migration guide from manual polling anti-patterns
- Best practices for different use cases
- Common use cases and troubleshooting guidance

* Add unit tests for SSM exit code handling enhancements

* ruff fix

* Add documentation to provider.yaml ; Fix spelling mistakes.

* Consolidate SSM exit code documentation into main SSM doc

* Consolidate SSM exit code tests into main system test

* Remove ssm_exit_codes.rst reference from provider.yaml

* reducing volume of docs and structure in docs

* fix empty line with trailing whitespace.

* refactor(aws/ssm): Extract AWS-level failure detection into reusable hook method

- Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic
- Update SsmRunCommandOperator to use the new hook method instead of inline status checks
- Update SsmRunCommandCompletedSensor to use the new hook method for consistency
- Update SsmRunCommandTrigger to use the new hook method for consistency
- Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses
- Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers
- Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed)

* fix MyPy checks

* Added "Cancelling" to the spelling wordlist. "Cancelling" is official AWS SSM status value as per AWS documentation.

* Update providers/amazon/docs/operators/ssm.rst

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

* ruff fix

---------

Co-authored-by: Kamen Sharlandjiev <awskamen@amazon.co.uk>
Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
chirodip98 pushed a commit to chirodip98/airflow-contrib that referenced this pull request Jan 9, 2026
…ting (apache#57753)

* Add fail_on_nonzero_exit parameter to SsmRunCommandOperator

Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator
and SsmRunCommandTrigger to allow workflows to continue when SSM
commands return non-zero exit codes, enabling exit-code-based
workflow routing.

When fail_on_nonzero_exit=False:
- Command-level failures (non-zero exit codes) are tolerated
- AWS-level failures (Cancelled, TimedOut) still raise exceptions
- Helpful log messages indicate the command status and exit code

The parameter defaults to True to maintain backward compatibility
with existing DAGs. This change supports both traditional (sync)
and deferrable (async) execution modes.

* Add fail_on_nonzero_exit parameter to SSM sensor and trigger

Enhance SSM sensor and trigger to support exit code routing patterns
by adding a fail_on_nonzero_exit parameter that allows DAGs to handle
command-level failures gracefully while still failing on AWS-level errors.

Changes:
- Add fail_on_nonzero_exit parameter to SsmSensor with default True
- Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True
- Add parameter to trigger's serialized_fields for proper serialization
- Implement inline status checks to distinguish AWS-level vs command-level failures
- AWS-level failures (Cancelled, TimedOut) always raise exceptions
- Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False
- Add comprehensive logging for both error and success scenarios in enhanced mode
- Maintain 100% backward compatibility (default behavior unchanged)

* Document SsmGetCommandInvocationOperator for exit code routing

Add  documentation and tests to clarify that
SsmGetCommandInvocationOperator provides all functionality
needed for exit code-based workflow routing.

Changes:
- Enhanced operator docstring with exit code routing use case examples
- Added workflow pattern documentation showing integration with enhanced mode
- Documented return value structure highlighting response_code field
- Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval
- Added test_multiple_exit_codes_for_routing for multi-instance scenarios

* Add  documentation for SSM exit code handling

Add detailed documentation for the enhanced exit code handling feature
in SSM operators and sensors, including usage patterns, migration guide,
and best practices.

The new documentation covers:
- Overview of enhanced vs traditional operational modes
- Four usage patterns with code examples (async, sync, routing, traditional)
- Complete parameter reference with behavior tables
- Migration guide from manual polling anti-patterns
- Best practices for different use cases
- Common use cases and troubleshooting guidance

* Add unit tests for SSM exit code handling enhancements

* ruff fix

* Add documentation to provider.yaml ; Fix spelling mistakes.

* Consolidate SSM exit code documentation into main SSM doc

* Consolidate SSM exit code tests into main system test

* Remove ssm_exit_codes.rst reference from provider.yaml

* reducing volume of docs and structure in docs

* fix empty line with trailing whitespace.

* refactor(aws/ssm): Extract AWS-level failure detection into reusable hook method

- Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic
- Update SsmRunCommandOperator to use the new hook method instead of inline status checks
- Update SsmRunCommandCompletedSensor to use the new hook method for consistency
- Update SsmRunCommandTrigger to use the new hook method for consistency
- Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses
- Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers
- Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed)

* fix MyPy checks

* Added "Cancelling" to the spelling wordlist. "Cancelling" is official AWS SSM status value as per AWS documentation.

* Update providers/amazon/docs/operators/ssm.rst

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

* ruff fix

---------

Co-authored-by: Kamen Sharlandjiev <awskamen@amazon.co.uk>
Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
stegololz pushed a commit to stegololz/airflow that referenced this pull request Jan 9, 2026
…ting (apache#57753)

* Add fail_on_nonzero_exit parameter to SsmRunCommandOperator

Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator
and SsmRunCommandTrigger to allow workflows to continue when SSM
commands return non-zero exit codes, enabling exit-code-based
workflow routing.

When fail_on_nonzero_exit=False:
- Command-level failures (non-zero exit codes) are tolerated
- AWS-level failures (Cancelled, TimedOut) still raise exceptions
- Helpful log messages indicate the command status and exit code

The parameter defaults to True to maintain backward compatibility
with existing DAGs. This change supports both traditional (sync)
and deferrable (async) execution modes.

* Add fail_on_nonzero_exit parameter to SSM sensor and trigger

Enhance SSM sensor and trigger to support exit code routing patterns
by adding a fail_on_nonzero_exit parameter that allows DAGs to handle
command-level failures gracefully while still failing on AWS-level errors.

Changes:
- Add fail_on_nonzero_exit parameter to SsmSensor with default True
- Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True
- Add parameter to trigger's serialized_fields for proper serialization
- Implement inline status checks to distinguish AWS-level vs command-level failures
- AWS-level failures (Cancelled, TimedOut) always raise exceptions
- Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False
- Add comprehensive logging for both error and success scenarios in enhanced mode
- Maintain 100% backward compatibility (default behavior unchanged)

* Document SsmGetCommandInvocationOperator for exit code routing

Add  documentation and tests to clarify that
SsmGetCommandInvocationOperator provides all functionality
needed for exit code-based workflow routing.

Changes:
- Enhanced operator docstring with exit code routing use case examples
- Added workflow pattern documentation showing integration with enhanced mode
- Documented return value structure highlighting response_code field
- Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval
- Added test_multiple_exit_codes_for_routing for multi-instance scenarios

* Add  documentation for SSM exit code handling

Add detailed documentation for the enhanced exit code handling feature
in SSM operators and sensors, including usage patterns, migration guide,
and best practices.

The new documentation covers:
- Overview of enhanced vs traditional operational modes
- Four usage patterns with code examples (async, sync, routing, traditional)
- Complete parameter reference with behavior tables
- Migration guide from manual polling anti-patterns
- Best practices for different use cases
- Common use cases and troubleshooting guidance

* Add unit tests for SSM exit code handling enhancements

* ruff fix

* Add documentation to provider.yaml ; Fix spelling mistakes.

* Consolidate SSM exit code documentation into main SSM doc

* Consolidate SSM exit code tests into main system test

* Remove ssm_exit_codes.rst reference from provider.yaml

* reducing volume of docs and structure in docs

* fix empty line with trailing whitespace.

* refactor(aws/ssm): Extract AWS-level failure detection into reusable hook method

- Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic
- Update SsmRunCommandOperator to use the new hook method instead of inline status checks
- Update SsmRunCommandCompletedSensor to use the new hook method for consistency
- Update SsmRunCommandTrigger to use the new hook method for consistency
- Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses
- Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers
- Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed)

* fix MyPy checks

* Added "Cancelling" to the spelling wordlist. "Cancelling" is official AWS SSM status value as per AWS documentation.

* Update providers/amazon/docs/operators/ssm.rst

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

* ruff fix

---------

Co-authored-by: Kamen Sharlandjiev <awskamen@amazon.co.uk>
Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants