Add fail_on_nonzero_exit parameter to SSM operators for exit code routing #57753

ksharlandjiev · 2025-11-03T15:54:01Z

Problem

SSM operators currently fail when commands return non-zero exit codes, making it impossible to:

Route workflows based on different exit codes
Handle commands where non-zero exit codes represent valid business states (e.g., partial success, warnings)
Implement conditional retry logic based on specific exit codes
Migrate from traditional schedulers like Autosys that support exit code routing

Users have been forced to implement manual polling workarounds with custom Python tasks to handle these scenarios.

Proposal

Add a fail_on_nonzero_exit parameter (default: True) to SsmRunCommandOperator, SsmRunCommandCompletedSensor, and SsmRunCommandTrigger.

When set to False:

Tasks complete successfully regardless of command exit codes
Exit codes can be retrieved with SsmGetCommandInvocationOperator for routing decisions
AWS-level failures (TimedOut, Cancelled) still raise exceptions
Command-level failures (non-zero exit codes) are tolerated

The default value of True maintains existing behavior for backward compatibility.

Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator and SsmRunCommandTrigger to allow workflows to continue when SSM commands return non-zero exit codes, enabling exit-code-based workflow routing. When fail_on_nonzero_exit=False: - Command-level failures (non-zero exit codes) are tolerated - AWS-level failures (Cancelled, TimedOut) still raise exceptions - Helpful log messages indicate the command status and exit code The parameter defaults to True to maintain backward compatibility with existing DAGs. This change supports both traditional (sync) and deferrable (async) execution modes.

Enhance SSM sensor and trigger to support exit code routing patterns by adding a fail_on_nonzero_exit parameter that allows DAGs to handle command-level failures gracefully while still failing on AWS-level errors. Changes: - Add fail_on_nonzero_exit parameter to SsmSensor with default True - Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True - Add parameter to trigger's serialized_fields for proper serialization - Implement inline status checks to distinguish AWS-level vs command-level failures - AWS-level failures (Cancelled, TimedOut) always raise exceptions - Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False - Add comprehensive logging for both error and success scenarios in enhanced mode - Maintain 100% backward compatibility (default behavior unchanged)

Add documentation and tests to clarify that SsmGetCommandInvocationOperator provides all functionality needed for exit code-based workflow routing. Changes: - Enhanced operator docstring with exit code routing use case examples - Added workflow pattern documentation showing integration with enhanced mode - Documented return value structure highlighting response_code field - Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval - Added test_multiple_exit_codes_for_routing for multi-instance scenarios

Add detailed documentation for the enhanced exit code handling feature in SSM operators and sensors, including usage patterns, migration guide, and best practices. The new documentation covers: - Overview of enhanced vs traditional operational modes - Four usage patterns with code examples (async, sync, routing, traditional) - Complete parameter reference with behavior tables - Migration guide from manual polling anti-patterns - Best practices for different use cases - Common use cases and troubleshooting guidance

vincbeck

My general comments:

I think the idea is great, that is indeed a feature user might want. Thanks for creating a PR for that
Creating documentation is great, however, I think the document is way too long. This is only my personal opinion so I would wait to see what others think but if for new parameter we are creating documentation that big, this will be impossible to maintain. Using AI to create code and/or documentation is great but we should also keep in mind, the longer is NOT the better. Again, I support the documentation, but this is way too big to me. As a user I wont probably read it all, and as a developer I am scared we need to maintain that
The system test is a great idea, could you please move these 3 examples in the current system test?

ksharlandjiev · 2025-11-03T22:10:24Z

My general comments:

* I think the idea is great, that is indeed a feature user might want. Thanks for creating a PR for that

* Creating documentation is great, however, I think the document is way too long. This is only my personal opinion so I would wait to see what others think but if for new parameter we are creating documentation that big, this will be impossible to maintain. Using AI to create code and/or documentation is great but we should also keep in mind, the longer is NOT the better. Again, I support the documentation, but this is way too big to me. As a user I wont probably read it all, and as a developer I am scared we need to maintain that

* The system test is a great idea, could you please move these 3 examples in the current system test?

Thanks for your feedback. I was on the fence myself on the extra docs, and I understand the concern. I'm happy to move all documented patterns to an external article.

o-nikolas

I've just looked through the docs and source code so far (not tests yet). I like the shortened docs.

Just one question: Can this be done today with trigger rules?

providers/amazon/src/airflow/providers/amazon/aws/operators/ssm.py

providers/amazon/src/airflow/providers/amazon/aws/triggers/ssm.py

…hook method - Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic - Update SsmRunCommandOperator to use the new hook method instead of inline status checks - Update SsmRunCommandCompletedSensor to use the new hook method for consistency - Update SsmRunCommandTrigger to use the new hook method for consistency - Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses - Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers - Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed)

…ing' into ssm-exit-code-handling

… AWS SSM status value as per AWS documentation.

o-nikolas

Have you run the system test to ensue that it's working correctly?

ksharlandjiev · 2025-12-10T23:55:54Z

Have you run the system test to ensue that it's working correctly?

Thanks for the approval! I’ve added a few additional tests to the system test to cover this change, following @vincbeck’s feedback, and I can confirm that everything executes successfully.

vincbeck

One nit but overall looks good!

providers/amazon/docs/operators/ssm.rst

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

…ting (apache#57753) * Add fail_on_nonzero_exit parameter to SsmRunCommandOperator Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator and SsmRunCommandTrigger to allow workflows to continue when SSM commands return non-zero exit codes, enabling exit-code-based workflow routing. When fail_on_nonzero_exit=False: - Command-level failures (non-zero exit codes) are tolerated - AWS-level failures (Cancelled, TimedOut) still raise exceptions - Helpful log messages indicate the command status and exit code The parameter defaults to True to maintain backward compatibility with existing DAGs. This change supports both traditional (sync) and deferrable (async) execution modes. * Add fail_on_nonzero_exit parameter to SSM sensor and trigger Enhance SSM sensor and trigger to support exit code routing patterns by adding a fail_on_nonzero_exit parameter that allows DAGs to handle command-level failures gracefully while still failing on AWS-level errors. Changes: - Add fail_on_nonzero_exit parameter to SsmSensor with default True - Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True - Add parameter to trigger's serialized_fields for proper serialization - Implement inline status checks to distinguish AWS-level vs command-level failures - AWS-level failures (Cancelled, TimedOut) always raise exceptions - Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False - Add comprehensive logging for both error and success scenarios in enhanced mode - Maintain 100% backward compatibility (default behavior unchanged) * Document SsmGetCommandInvocationOperator for exit code routing Add documentation and tests to clarify that SsmGetCommandInvocationOperator provides all functionality needed for exit code-based workflow routing. Changes: - Enhanced operator docstring with exit code routing use case examples - Added workflow pattern documentation showing integration with enhanced mode - Documented return value structure highlighting response_code field - Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval - Added test_multiple_exit_codes_for_routing for multi-instance scenarios * Add documentation for SSM exit code handling Add detailed documentation for the enhanced exit code handling feature in SSM operators and sensors, including usage patterns, migration guide, and best practices. The new documentation covers: - Overview of enhanced vs traditional operational modes - Four usage patterns with code examples (async, sync, routing, traditional) - Complete parameter reference with behavior tables - Migration guide from manual polling anti-patterns - Best practices for different use cases - Common use cases and troubleshooting guidance * Add unit tests for SSM exit code handling enhancements * ruff fix * Add documentation to provider.yaml ; Fix spelling mistakes. * Consolidate SSM exit code documentation into main SSM doc * Consolidate SSM exit code tests into main system test * Remove ssm_exit_codes.rst reference from provider.yaml * reducing volume of docs and structure in docs * fix empty line with trailing whitespace. * refactor(aws/ssm): Extract AWS-level failure detection into reusable hook method - Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic - Update SsmRunCommandOperator to use the new hook method instead of inline status checks - Update SsmRunCommandCompletedSensor to use the new hook method for consistency - Update SsmRunCommandTrigger to use the new hook method for consistency - Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses - Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers - Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed) * fix MyPy checks * Added "Cancelling" to the spelling wordlist. "Cancelling" is official AWS SSM status value as per AWS documentation. * Update providers/amazon/docs/operators/ssm.rst Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com> * ruff fix --------- Co-authored-by: Kamen Sharlandjiev <awskamen@amazon.co.uk> Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

Kamen Sharlandjiev added 6 commits November 3, 2025 12:04

Add unit tests for SSM exit code handling enhancements

84ba52c

ruff fix

696008a

ksharlandjiev requested review from eladkal and o-nikolas as code owners November 3, 2025 15:54

boring-cyborg bot added area:providers kind:documentation provider:amazon AWS/Amazon - related issues labels Nov 3, 2025

Add documentation to provider.yaml ; Fix spelling mistakes.

c9e1a8f

vincbeck reviewed Nov 3, 2025

View reviewed changes

ksharlandjiev marked this pull request as draft November 4, 2025 16:45

Kamen Sharlandjiev added 3 commits November 4, 2025 22:13

Consolidate SSM exit code documentation into main SSM doc

cc9d014

Consolidate SSM exit code tests into main system test

2dc8108

Remove ssm_exit_codes.rst reference from provider.yaml

2e6b795

o-nikolas reviewed Nov 5, 2025

View reviewed changes

Kamen Sharlandjiev and others added 9 commits November 5, 2025 09:04

reducing volume of docs and structure in docs

ee7a4c8

fix empty line with trailing whitespace.

f492432

Merge branch 'main' into ssm-exit-code-handling

6f082bb

Merge branch 'apache:main' into ssm-exit-code-handling

f8450e1

Merge branch 'main' into ssm-exit-code-handling

2ef5597

fix MyPy checks

b018788

Merge remote-tracking branch 'refs/remotes/origin/ssm-exit-code-handl…

43dbb7a

…ing' into ssm-exit-code-handling

Added "Cancelling" to the spelling wordlist. "Cancelling" is official…

a6db3ac

… AWS SSM status value as per AWS documentation.

ksharlandjiev marked this pull request as ready for review December 8, 2025 15:10

o-nikolas approved these changes Dec 10, 2025

View reviewed changes

Merge branch 'main' into ssm-exit-code-handling

4483861

vincbeck reviewed Jan 7, 2026

View reviewed changes

providers/amazon/docs/operators/ssm.rst Outdated Show resolved Hide resolved

Update providers/amazon/docs/operators/ssm.rst

939a289

Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>

vincbeck approved these changes Jan 7, 2026

View reviewed changes

ruff fix

38aa9cb

vincbeck merged commit f622936 into apache:main Jan 7, 2026
127 checks passed

ksharlandjiev deleted the ssm-exit-code-handling branch January 7, 2026 16:57

jscheffl mentioned this pull request Jan 14, 2026

Status of testing Providers that were prepared on January 13th, 2026 #60496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add fail_on_nonzero_exit parameter to SSM operators for exit code routing #57753

Add fail_on_nonzero_exit parameter to SSM operators for exit code routing #57753

Uh oh!

ksharlandjiev commented Nov 3, 2025

Uh oh!

vincbeck left a comment

Uh oh!

ksharlandjiev commented Nov 3, 2025

Uh oh!

o-nikolas left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

o-nikolas left a comment

Uh oh!

ksharlandjiev commented Dec 10, 2025 •

edited

Loading

Uh oh!

vincbeck left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add fail_on_nonzero_exit parameter to SSM operators for exit code routing #57753

Add fail_on_nonzero_exit parameter to SSM operators for exit code routing #57753

Uh oh!

Conversation

ksharlandjiev commented Nov 3, 2025

Problem

Proposal

Uh oh!

vincbeck left a comment

Choose a reason for hiding this comment

Uh oh!

ksharlandjiev commented Nov 3, 2025

Uh oh!

o-nikolas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

o-nikolas left a comment

Choose a reason for hiding this comment

Uh oh!

ksharlandjiev commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincbeck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ksharlandjiev commented Dec 10, 2025 •

edited

Loading