Skip to content

Conversation

@ketozhang
Copy link
Contributor


Resolves #54964

K8s pod events of the type=Warning was reported to Airflow logs at the error level. This is too high of a level for the intention of the event type. I chose to report all warning as warning and everything else as info as I think this follows the purpose of K8s events documented as

Events should be treated as informative, best-effort, supplemental data.

Most if not all the pod events are retryable within K8s' scheduler. I have seen these events being an error tricks pipeline users/operators into thinking the K8s scheduler is at fault it's likely pod TTL issues.

  • Warning messages are logged as warning
  • Normal messages are logged as info
  • Any others are logged as info (v1 Event includes only Warning and Normal)

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Aug 27, 2025
@eladkal eladkal requested a review from romsharon98 August 27, 2025 18:23
Copy link
Contributor

@romsharon98 romsharon98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the PR that submitted this change #37944
and the kubernetes docs there is only Normal and Warning types, therefore warning is log as error on purpose.

@ketozhang
Copy link
Contributor Author

ketozhang commented Aug 29, 2025

@romsharon98 I don't think it was an intentional choice. It was just a transition from the status quo before #37944 (every event was logged as error).

I don't think fundamentally any of the Pod Event logs should be logged as error.

I'm not sure how to say it much simpler than "Warning type events should be logged as warning"

Maybe the PR author can chime in @sudiptob2. Perhaps the original issuer (#36077 and #54964 both were me).

@ketozhang
Copy link
Contributor Author

Let's say it this way. What is the real error message in this output of KuberntesPodOperator task?

...
[2025-08-15, 18:16:49 UTC] {pod.py:1027} ERROR - Pod Event: FailedMount - MountVolume.MountDevice failed for volume "pvc-229d4d89-eb1e-45da-be2f-aa50d0799350" : rpc error: code = Internal desc = Failed to find device path /dev/xvdae. no device path for device "/dev/xvdae" volume "vol-0a93da870e4b7192e" found
[2025-08-15, 18:16:49 UTC] {pod.py:1027} ERROR - Pod Event: FailedCreatePodSandBox - Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
[2025-08-15, 18:16:50 UTC] {pod.py:1025} INFO - Pod Event: Scheduled - Successfully assigned airflow/redacted to redacted.ec2.internal
[2025-08-15, 18:16:50 UTC] {pod.py:1025} INFO - Pod Event: SuccessfulAttachVolume - AttachVolume.Attach succeeded for volume "pvc-229d4d89-eb1e-45da-be2f-aa50d0799350"
[2025-08-15, 18:16:50 UTC] {pod.py:1027} ERROR - Pod Event: FailedMount - MountVolume.MountDevice failed for volume "pvc-229d4d89-eb1e-45da-be2f-aa50d0799350" : rpc error: code = Internal desc = Failed to find device path /dev/xvdae. no device path for device "/dev/xvdae" volume "vol-0a93da870e4b7192e" found
[2025-08-15, 18:16:50 UTC] {pod.py:1027} ERROR - Pod Event: FailedCreatePodSandBox - Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
[2025-08-15, 18:16:50 UTC] {pod.py:1076} INFO - Deleting pod: redacted
[2025-08-15, 18:16:51 UTC] {taskinstance.py:3336} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 647, in execute_sync
    self.await_pod_start(pod=self.pod)
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 582, in await_pod_start
    self.pod_manager.await_pod_start(
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 419, in await_pod_start
    raise PodLaunchFailedException(
airflow.providers.cncf.kubernetes.utils.pod_manager.PodLaunchFailedException: Pod took too long to start. More than 300s. Check the pod events in kubernetes.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 776, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 742, in _execute_callable
...

I have many many devs confused and report to DevOps saying the K8s cluster is broken since they're getting pod event errors like FailedMount, FailedScheduling, etc. however if their KPO timeout was longer, the cluster would've auto-solved it.

@ketozhang ketozhang force-pushed the gh-54964-k8s-pod-events-warnings branch from ae08100 to 1f9e19a Compare September 6, 2025 20:02
@ketozhang ketozhang force-pushed the gh-54964-k8s-pod-events-warnings branch from 1f9e19a to fc35cbc Compare September 6, 2025 20:04
@ketozhang
Copy link
Contributor Author

@romsharon98 Done and rebased.

@romsharon98 romsharon98 merged commit f8f604a into apache:main Sep 7, 2025
87 checks passed
mangal-vairalkar pushed a commit to mangal-vairalkar/airflow that referenced this pull request Sep 7, 2025
RoyLee1224 pushed a commit to RoyLee1224/airflow that referenced this pull request Sep 8, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Sep 30, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 1, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 2, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 3, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 4, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 5, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 5, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 7, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 8, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 9, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 10, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 11, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 12, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 14, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 15, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 17, 2025
abdulrahman305 bot pushed a commit to abdulrahman305/airflow that referenced this pull request Oct 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Log level for KubernetesPodOperator's Pod Event FailedSchedulingshould be warning not error

2 participants