Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix KeyError when KPO exits too soon #37508

Merged
merged 1 commit into from
Feb 17, 2024

Conversation

vchiapaikeo
Copy link
Contributor

@vchiapaikeo vchiapaikeo commented Feb 17, 2024

On occasions where the pod terminates quickly on pod failure, instead of an AirflowException being raised, we see a KeyError exception:

2024-02-17, 08:35:59 EST] {base.py:83} INFO - Using connection ID 'google_cloud_default' for task execution.
[2024-02-17, 08:35:59 EST] {credentials_provider.py:353} INFO - Getting connection using `google.auth.default()` since no explicit credentials are provided.
[2024-02-17, 08:36:00 EST] {pod_manager.py:798} INFO - Running command... if [ -s /***/xcom/return.json ]; then cat /***/xcom/return.json; else echo __***_xcom_result_empty__; fi
[2024-02-17, 08:36:00 EST] {pod_manager.py:798} INFO - Running command... kill -s SIGINT 1
[2024-02-17, 08:36:00 EST] {pod.py:559} INFO - xcom result file is empty.
[2024-02-17, 08:36:00 EST] {pod.py:709} INFO - Got event: {'status': 'failed', 'namespace': '***-default', 'name': 'fail-quzr387j'}
[2024-02-17, 08:36:00 EST] {pod.py:773} INFO - Container logs: + sleep 2
[2024-02-17, 08:36:00 EST] {pod.py:773} INFO - Container logs: + exit 1
[2024-02-17, 08:36:00 EST] {pod.py:773} INFO - Container logs: 
[2024-02-17, 08:36:00 EST] {pod_manager.py:616} INFO - Pod fail-quzr387j has phase Running
[2024-02-17, 08:36:03 EST] {pod.py:907} INFO - Skipping deleting pod: fail-quzr387j
[2024-02-17, 08:36:03 EST] {taskinstance.py:2751} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/airflow/providers/cncf/kubernetes/operators/pod.py", line 711, in trigger_reentry
    message = event.get("stack_trace", event["message"])
KeyError: 'message'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/airflow/airflow/models/taskinstance.py", line 446, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/airflow/models/taskinstance.py", line 416, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/airflow/models/baseoperator.py", line 1623, in resume_execution
    return execute_callable(context)
  File "/opt/airflow/airflow/providers/google/cloud/operators/kubernetes_engine.py", line 789, in execute_complete
    return super().execute_complete(context, event, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/deprecated/classic.py", line 285, in wrapper_function
    return wrapped_(*args_, **kwargs_)
  File "/opt/airflow/airflow/providers/cncf/kubernetes/operators/pod.py", line 762, in execute_complete
    self.trigger_reentry(context=context, event=event)
  File "/opt/airflow/airflow/providers/cncf/kubernetes/operators/pod.py", line 740, in trigger_reentry
    self._clean(event)
  File "/opt/airflow/airflow/providers/cncf/kubernetes/operators/pod.py", line 755, in _clean
    self.post_complete_action(
  File "/opt/airflow/airflow/providers/cncf/kubernetes/operators/pod.py", line 783, in post_complete_action
    self.cleanup(
  File "/opt/airflow/airflow/providers/cncf/kubernetes/operators/pod.py", line 834, in cleanup
    raise AirflowException(
airflow.exceptions.AirflowException: Pod fail-quzr387j returned a failure.

The reason for this is because this TriggerEvent payload does not pass a message field which is expected here.

Granted, this is not really a big problem. The next line is the raise anyways. It's just a bit confusing to the user.

This fixes the missing message KeyError so that we see the proper exception getting raised:

image

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes provider related issues labels Feb 17, 2024
Copy link
Member

@pankajastro pankajastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jscheffl jscheffl merged commit d50a25b into apache:main Feb 17, 2024
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants