-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Add fail_on_nonzero_exit parameter to SSM operators for exit code routing #57753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator and SsmRunCommandTrigger to allow workflows to continue when SSM commands return non-zero exit codes, enabling exit-code-based workflow routing. When fail_on_nonzero_exit=False: - Command-level failures (non-zero exit codes) are tolerated - AWS-level failures (Cancelled, TimedOut) still raise exceptions - Helpful log messages indicate the command status and exit code The parameter defaults to True to maintain backward compatibility with existing DAGs. This change supports both traditional (sync) and deferrable (async) execution modes.
Enhance SSM sensor and trigger to support exit code routing patterns by adding a fail_on_nonzero_exit parameter that allows DAGs to handle command-level failures gracefully while still failing on AWS-level errors. Changes: - Add fail_on_nonzero_exit parameter to SsmSensor with default True - Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True - Add parameter to trigger's serialized_fields for proper serialization - Implement inline status checks to distinguish AWS-level vs command-level failures - AWS-level failures (Cancelled, TimedOut) always raise exceptions - Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False - Add comprehensive logging for both error and success scenarios in enhanced mode - Maintain 100% backward compatibility (default behavior unchanged)
Add documentation and tests to clarify that SsmGetCommandInvocationOperator provides all functionality needed for exit code-based workflow routing. Changes: - Enhanced operator docstring with exit code routing use case examples - Added workflow pattern documentation showing integration with enhanced mode - Documented return value structure highlighting response_code field - Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval - Added test_multiple_exit_codes_for_routing for multi-instance scenarios
Add detailed documentation for the enhanced exit code handling feature in SSM operators and sensors, including usage patterns, migration guide, and best practices. The new documentation covers: - Overview of enhanced vs traditional operational modes - Four usage patterns with code examples (async, sync, routing, traditional) - Complete parameter reference with behavior tables - Migration guide from manual polling anti-patterns - Best practices for different use cases - Common use cases and troubleshooting guidance
vincbeck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general comments:
- I think the idea is great, that is indeed a feature user might want. Thanks for creating a PR for that
- Creating documentation is great, however, I think the document is way too long. This is only my personal opinion so I would wait to see what others think but if for new parameter we are creating documentation that big, this will be impossible to maintain. Using AI to create code and/or documentation is great but we should also keep in mind, the longer is NOT the better. Again, I support the documentation, but this is way too big to me. As a user I wont probably read it all, and as a developer I am scared we need to maintain that
- The system test is a great idea, could you please move these 3 examples in the current system test?
Thanks for your feedback. I was on the fence myself on the extra docs, and I understand the concern. I'm happy to move all documented patterns to an external article. |
o-nikolas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just looked through the docs and source code so far (not tests yet). I like the shortened docs.
Just one question: Can this be done today with trigger rules?
providers/amazon/src/airflow/providers/amazon/aws/operators/ssm.py
Outdated
Show resolved
Hide resolved
providers/amazon/src/airflow/providers/amazon/aws/operators/ssm.py
Outdated
Show resolved
Hide resolved
…hook method - Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic - Update SsmRunCommandOperator to use the new hook method instead of inline status checks - Update SsmRunCommandCompletedSensor to use the new hook method for consistency - Update SsmRunCommandTrigger to use the new hook method for consistency - Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses - Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers - Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed)
…ing' into ssm-exit-code-handling
… AWS SSM status value as per AWS documentation.
o-nikolas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you run the system test to ensue that it's working correctly?
Thanks for the approval! I’ve added a few additional tests to the system test to cover this change, following @vincbeck’s feedback, and I can confirm that everything executes successfully. |
vincbeck
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit but overall looks good!
Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
…ting (apache#57753) * Add fail_on_nonzero_exit parameter to SsmRunCommandOperator Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator and SsmRunCommandTrigger to allow workflows to continue when SSM commands return non-zero exit codes, enabling exit-code-based workflow routing. When fail_on_nonzero_exit=False: - Command-level failures (non-zero exit codes) are tolerated - AWS-level failures (Cancelled, TimedOut) still raise exceptions - Helpful log messages indicate the command status and exit code The parameter defaults to True to maintain backward compatibility with existing DAGs. This change supports both traditional (sync) and deferrable (async) execution modes. * Add fail_on_nonzero_exit parameter to SSM sensor and trigger Enhance SSM sensor and trigger to support exit code routing patterns by adding a fail_on_nonzero_exit parameter that allows DAGs to handle command-level failures gracefully while still failing on AWS-level errors. Changes: - Add fail_on_nonzero_exit parameter to SsmSensor with default True - Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True - Add parameter to trigger's serialized_fields for proper serialization - Implement inline status checks to distinguish AWS-level vs command-level failures - AWS-level failures (Cancelled, TimedOut) always raise exceptions - Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False - Add comprehensive logging for both error and success scenarios in enhanced mode - Maintain 100% backward compatibility (default behavior unchanged) * Document SsmGetCommandInvocationOperator for exit code routing Add documentation and tests to clarify that SsmGetCommandInvocationOperator provides all functionality needed for exit code-based workflow routing. Changes: - Enhanced operator docstring with exit code routing use case examples - Added workflow pattern documentation showing integration with enhanced mode - Documented return value structure highlighting response_code field - Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval - Added test_multiple_exit_codes_for_routing for multi-instance scenarios * Add documentation for SSM exit code handling Add detailed documentation for the enhanced exit code handling feature in SSM operators and sensors, including usage patterns, migration guide, and best practices. The new documentation covers: - Overview of enhanced vs traditional operational modes - Four usage patterns with code examples (async, sync, routing, traditional) - Complete parameter reference with behavior tables - Migration guide from manual polling anti-patterns - Best practices for different use cases - Common use cases and troubleshooting guidance * Add unit tests for SSM exit code handling enhancements * ruff fix * Add documentation to provider.yaml ; Fix spelling mistakes. * Consolidate SSM exit code documentation into main SSM doc * Consolidate SSM exit code tests into main system test * Remove ssm_exit_codes.rst reference from provider.yaml * reducing volume of docs and structure in docs * fix empty line with trailing whitespace. * refactor(aws/ssm): Extract AWS-level failure detection into reusable hook method - Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic - Update SsmRunCommandOperator to use the new hook method instead of inline status checks - Update SsmRunCommandCompletedSensor to use the new hook method for consistency - Update SsmRunCommandTrigger to use the new hook method for consistency - Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses - Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers - Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed) * fix MyPy checks * Added "Cancelling" to the spelling wordlist. "Cancelling" is official AWS SSM status value as per AWS documentation. * Update providers/amazon/docs/operators/ssm.rst Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com> * ruff fix --------- Co-authored-by: Kamen Sharlandjiev <awskamen@amazon.co.uk> Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
…ting (apache#57753) * Add fail_on_nonzero_exit parameter to SsmRunCommandOperator Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator and SsmRunCommandTrigger to allow workflows to continue when SSM commands return non-zero exit codes, enabling exit-code-based workflow routing. When fail_on_nonzero_exit=False: - Command-level failures (non-zero exit codes) are tolerated - AWS-level failures (Cancelled, TimedOut) still raise exceptions - Helpful log messages indicate the command status and exit code The parameter defaults to True to maintain backward compatibility with existing DAGs. This change supports both traditional (sync) and deferrable (async) execution modes. * Add fail_on_nonzero_exit parameter to SSM sensor and trigger Enhance SSM sensor and trigger to support exit code routing patterns by adding a fail_on_nonzero_exit parameter that allows DAGs to handle command-level failures gracefully while still failing on AWS-level errors. Changes: - Add fail_on_nonzero_exit parameter to SsmSensor with default True - Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True - Add parameter to trigger's serialized_fields for proper serialization - Implement inline status checks to distinguish AWS-level vs command-level failures - AWS-level failures (Cancelled, TimedOut) always raise exceptions - Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False - Add comprehensive logging for both error and success scenarios in enhanced mode - Maintain 100% backward compatibility (default behavior unchanged) * Document SsmGetCommandInvocationOperator for exit code routing Add documentation and tests to clarify that SsmGetCommandInvocationOperator provides all functionality needed for exit code-based workflow routing. Changes: - Enhanced operator docstring with exit code routing use case examples - Added workflow pattern documentation showing integration with enhanced mode - Documented return value structure highlighting response_code field - Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval - Added test_multiple_exit_codes_for_routing for multi-instance scenarios * Add documentation for SSM exit code handling Add detailed documentation for the enhanced exit code handling feature in SSM operators and sensors, including usage patterns, migration guide, and best practices. The new documentation covers: - Overview of enhanced vs traditional operational modes - Four usage patterns with code examples (async, sync, routing, traditional) - Complete parameter reference with behavior tables - Migration guide from manual polling anti-patterns - Best practices for different use cases - Common use cases and troubleshooting guidance * Add unit tests for SSM exit code handling enhancements * ruff fix * Add documentation to provider.yaml ; Fix spelling mistakes. * Consolidate SSM exit code documentation into main SSM doc * Consolidate SSM exit code tests into main system test * Remove ssm_exit_codes.rst reference from provider.yaml * reducing volume of docs and structure in docs * fix empty line with trailing whitespace. * refactor(aws/ssm): Extract AWS-level failure detection into reusable hook method - Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic - Update SsmRunCommandOperator to use the new hook method instead of inline status checks - Update SsmRunCommandCompletedSensor to use the new hook method for consistency - Update SsmRunCommandTrigger to use the new hook method for consistency - Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses - Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers - Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed) * fix MyPy checks * Added "Cancelling" to the spelling wordlist. "Cancelling" is official AWS SSM status value as per AWS documentation. * Update providers/amazon/docs/operators/ssm.rst Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com> * ruff fix --------- Co-authored-by: Kamen Sharlandjiev <awskamen@amazon.co.uk> Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
…ting (apache#57753) * Add fail_on_nonzero_exit parameter to SsmRunCommandOperator Add a new fail_on_nonzero_exit parameter to SsmRunCommandOperator and SsmRunCommandTrigger to allow workflows to continue when SSM commands return non-zero exit codes, enabling exit-code-based workflow routing. When fail_on_nonzero_exit=False: - Command-level failures (non-zero exit codes) are tolerated - AWS-level failures (Cancelled, TimedOut) still raise exceptions - Helpful log messages indicate the command status and exit code The parameter defaults to True to maintain backward compatibility with existing DAGs. This change supports both traditional (sync) and deferrable (async) execution modes. * Add fail_on_nonzero_exit parameter to SSM sensor and trigger Enhance SSM sensor and trigger to support exit code routing patterns by adding a fail_on_nonzero_exit parameter that allows DAGs to handle command-level failures gracefully while still failing on AWS-level errors. Changes: - Add fail_on_nonzero_exit parameter to SsmSensor with default True - Add fail_on_nonzero_exit parameter to SsmRunCommandTrigger with default True - Add parameter to trigger's serialized_fields for proper serialization - Implement inline status checks to distinguish AWS-level vs command-level failures - AWS-level failures (Cancelled, TimedOut) always raise exceptions - Command-level failures (Failed with exit code) are tolerated when fail_on_nonzero_exit=False - Add comprehensive logging for both error and success scenarios in enhanced mode - Maintain 100% backward compatibility (default behavior unchanged) * Document SsmGetCommandInvocationOperator for exit code routing Add documentation and tests to clarify that SsmGetCommandInvocationOperator provides all functionality needed for exit code-based workflow routing. Changes: - Enhanced operator docstring with exit code routing use case examples - Added workflow pattern documentation showing integration with enhanced mode - Documented return value structure highlighting response_code field - Added test_exit_code_routing_use_case to demonstrate custom exit code retrieval - Added test_multiple_exit_codes_for_routing for multi-instance scenarios * Add documentation for SSM exit code handling Add detailed documentation for the enhanced exit code handling feature in SSM operators and sensors, including usage patterns, migration guide, and best practices. The new documentation covers: - Overview of enhanced vs traditional operational modes - Four usage patterns with code examples (async, sync, routing, traditional) - Complete parameter reference with behavior tables - Migration guide from manual polling anti-patterns - Best practices for different use cases - Common use cases and troubleshooting guidance * Add unit tests for SSM exit code handling enhancements * ruff fix * Add documentation to provider.yaml ; Fix spelling mistakes. * Consolidate SSM exit code documentation into main SSM doc * Consolidate SSM exit code tests into main system test * Remove ssm_exit_codes.rst reference from provider.yaml * reducing volume of docs and structure in docs * fix empty line with trailing whitespace. * refactor(aws/ssm): Extract AWS-level failure detection into reusable hook method - Add `is_aws_level_failure()` static method to SsmHook to centralize AWS-level failure detection logic - Update SsmRunCommandOperator to use the new hook method instead of inline status checks - Update SsmRunCommandCompletedSensor to use the new hook method for consistency - Update SsmRunCommandTrigger to use the new hook method for consistency - Add comprehensive unit tests for `is_aws_level_failure()` covering all SSM command statuses - Improve code maintainability by eliminating duplicate failure detection logic across operators, sensors, and triggers - Clarify distinction between AWS-level failures (Cancelled, TimedOut, Cancelling) and command-level failures (Failed) * fix MyPy checks * Added "Cancelling" to the spelling wordlist. "Cancelling" is official AWS SSM status value as per AWS documentation. * Update providers/amazon/docs/operators/ssm.rst Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com> * ruff fix --------- Co-authored-by: Kamen Sharlandjiev <awskamen@amazon.co.uk> Co-authored-by: Vincent <97131062+vincbeck@users.noreply.github.com>
Problem
SSM operators currently fail when commands return non-zero exit codes, making it impossible to:
Users have been forced to implement manual polling workarounds with custom Python tasks to handle these scenarios.
Proposal
Add a
fail_on_nonzero_exitparameter (default:True) toSsmRunCommandOperator,SsmRunCommandCompletedSensor, andSsmRunCommandTrigger.When set to
False:SsmGetCommandInvocationOperatorfor routing decisionsThe default value of
Truemaintains existing behavior for backward compatibility.