Skip to content

Conversation

@HsiuChuanHsu
Copy link
Contributor

@HsiuChuanHsu HsiuChuanHsu commented Aug 4, 2025

Description

Add detailed logging of pod and container failure information when tasks fail, including pod phase, reasons, messages, and container states for better debugging.

Problem Statement

When tasks fail in KubernetesExecutor, operators often lack sufficient information to quickly diagnose the root cause of failures. The current implementation provides minimal failure context, making troubleshooting time-consuming and inefficient.

Solution

Introduced comprehensive logging for pod and container failure states, capturing detailed error information to streamline debugging.

Changes Made

Core Implementation

  • Enhanced _change_state method in KubernetesExecutor
  • Added pod status extraction using kube_client.read_namespaced_pod()
  • Implemented container state analysis for terminated and waiting states
  • Added graceful exception handling for Kubernetes API failures

Closes: #37548


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Copy link
Contributor

@hterik hterik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

Copy link
Contributor

@hterik hterik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor improvements suggested. Otherwise looks ok to me.

Someone else with more knowledge and authority should take a look as well.

@HsiuChuanHsu HsiuChuanHsu force-pushed the feature/kubernetes-executor-pod-failed-logging branch from 2d2e741 to 1725912 Compare August 10, 2025 23:33
@HsiuChuanHsu
Copy link
Contributor Author

@hterik Thanks for the review!
Appreciate all the great suggestions - these are things I hadn't thought of. I've updated the code accordingly.

Copy link
Contributor

@hterik hterik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@HsiuChuanHsu HsiuChuanHsu force-pushed the feature/kubernetes-executor-pod-failed-logging branch from 1725912 to 6a153c0 Compare August 11, 2025 13:16
Copy link
Contributor

@hterik hterik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpicks from me only, otherwise looks good.
Someone else need to approve.

@HsiuChuanHsu
Copy link
Contributor Author

Thanks again for the review ! 🤩

Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the PR!
Only a question regarding where to call collect_pod_failure_details(pod).

Add detailed logging of pod and container failure information when tasks fail,
including pod phase, reasons, messages, and container states for better debugging.

- Extract pod status (phase, reason, message) on task failure
- Extract container state info (terminated/waiting reasons & messages)
- Add exception handling for Kubernetes API failures
- Only execute additional logging for FAILED task states
Move failure analysis to watcher thread, add detailed container status logging,
and include task keys for better log searchability.
- Add FailureDetails TypedDict for type-safe failure information
- Extract collect_pod_failure_details() function for cleaner separation of concerns
- Create _analyze_init_containers() and _analyze_main_containers() helper functions
…modular design

1. Introduce FailureDetails TypedDict
2. Implement collect_pod_failure_details Function
3. Enhanced Failure Logging
… failure analysis

- Pass logger as parameter to collect_pod_failure_details for consistent logging context

- Move failure details collection logic to only execute for Failed status pods for better performance
- Update test imports to handle timezone module location changes between Airflow versions
@HsiuChuanHsu HsiuChuanHsu force-pushed the feature/kubernetes-executor-pod-failed-logging branch from 6da99f2 to 9fad3af Compare August 20, 2025 23:23
@jason810496 jason810496 merged commit 89ea580 into apache:main Aug 22, 2025
84 checks passed
@HsiuChuanHsu
Copy link
Contributor Author

Thanks for the views! 🤩

mangal-vairalkar pushed a commit to mangal-vairalkar/airflow that referenced this pull request Aug 30, 2025
* feat: Enhanced pod failure logging in KubernetesExecutor

Add detailed logging of pod and container failure information when tasks fail,
including pod phase, reasons, messages, and container states for better debugging.

- Extract pod status (phase, reason, message) on task failure
- Extract container state info (terminated/waiting reasons & messages)
- Add exception handling for Kubernetes API failures
- Only execute additional logging for FAILED task states

* feat(k8s-executor): improve pod failure diagnostics and reduce API calls

Move failure analysis to watcher thread, add detailed container status logging,
and include task keys for better log searchability.

* refactor: Extract pod failure analysis to dedicated TypedDict functions

- Add FailureDetails TypedDict for type-safe failure information
- Extract collect_pod_failure_details() function for cleaner separation of concerns
- Create _analyze_init_containers() and _analyze_main_containers() helper functions

* feat(kubernetes): enhance pod failure diagnostics with TypedDict and modular design
1. Introduce FailureDetails TypedDict
2. Implement collect_pod_failure_details Function
3. Enhanced Failure Logging

* fix(kubernetes): improve type safety with Literal types for container analysis

* feat: enhance pod failure logging in KubernetesExecutor with detailed failure analysis
- Pass logger as parameter to collect_pod_failure_details for consistent logging context

- Move failure details collection logic to only execute for Failed status pods for better performance
- Update test imports to handle timezone module location changes between Airflow versions
@HsiuChuanHsu HsiuChuanHsu deleted the feature/kubernetes-executor-pod-failed-logging branch September 9, 2025 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Print kubernetes failure Status and Reason on pod failures

3 participants