Bugfix/fix spark k8s pod duplicate issue#61110
Conversation
…or, the task won't fail and will recover
SameerMesiah97
left a comment
There was a problem hiding this comment.
Excluding terminating pods is a valid filter but I’m not sure if we should remove the existing sorting keys.
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py
Show resolved
Hide resolved
|
Hello @SameerMesiah97 , I have replied to your comments, I would love to hear if you have any other comments |
…fix-spark-k8s-pod-duplicate-issue
b20fed6 to
6686be7
Compare
jscheffl
left a comment
There was a problem hiding this comment.
If okay with @SameerMesiah97 I#d be OK to merge. Not an expert in SparkK8s but changes seem to be reasonable.
I personally feel we should have another pair of expert eyes on this. @Nataneljpwd raised a point about quorum reads which I’m not knowledgeable enough to verify. |
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_spark_kubernetes.py
Show resolved
Hide resolved
…fix-spark-k8s-pod-duplicate-issue
77456fc to
792f229
Compare
|
I still have two blocking concerns here:
Because of that, removing the existing phase-based prioritization is risky. I see that you are now prioritizing 'Succeeded' pods but since you have removed the ordering for I would advise you to implement something like this: This handles the edge cases revealed by this PR without accidentally returning multiple pods. I would advise removing the field selector as well because we are already filtering in the lambda expression. |
…fix-spark-k8s-pod-duplicate-issue
|
Hello, thank you for the comment!
This is a valid concern, though from reading the k8s api documentation, not setting a
Yes, though the Spark kubernetes operator which handles the spark crd makes sure there is only 1 driver up at any given time, the transition does not happen, as before a pod is created, the old pods are deleted, as written here.
Don't we want to prioritize Succeeded pods? as there is no case in which a driver for an application will be running while there is a pod in
We can leave the field selector, as we do not need to implicitly choose Running pods, as a running pod can only occur if it is terminating, meaning that deletion timestamp is not None, in addition to not having the RUNNING preference, as it means we might select a terminating pod before a newly created pod. |
|
In my view just one minor static check, then LGTM! |
…fix-spark-k8s-pod-duplicate-issue
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/spark_kubernetes.py
Show resolved
Hide resolved
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_spark_kubernetes.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_spark_kubernetes.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_spark_kubernetes.py
Show resolved
Hide resolved
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_spark_kubernetes.py
Show resolved
Hide resolved
|
I rest my case. I am not fully convinced that there cannot be multiple pods returned in your new implementation for I appreciate all the thought and effort you put into this but whilst going through the diff, I did see quite a few areas in need of refinement. I have made comments where I saw specific areas which could be improved but I think it would not hurt to review the new implementation for |
…fix-spark-k8s-pod-duplicate-issue
…thub.com/Nataneljpwd/airflow into bugfix/fix-spark-k8s-pod-duplicate-issue
|
@SameerMesiah97 Thank you for the review!
It is fine to block this PR if you do not think that there cannot be multiple pods returned, maybe I missed something, I do not want a PR with a supposed bugfix merged when there is even a slight concern the bug was not really fixed. |
Let's just focus on getting this PR into a mergeable state. |
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_spark_kubernetes.py
Outdated
Show resolved
Hide resolved
|
I would remove the test |
|
@SameerMesiah97 I finished answering all of your comments, I would appreciate another review |
|
Thanks for all the effort. The PR is now in a mergeable state. Apparently, there was an unrelated CI failure on main (which has been fixed in #61469) the last time you pushed your changes. I would request you to rebase and force push again so that the CI workflow can run again and hopefully complete successfully. If the CI workflow is re-triggered by the time you see this comment, there is no need to do that. Once CI is all green, I will approve this PR. |
I am okay witth this PR being merged now. |
jscheffl
left a comment
There was a problem hiding this comment.
Thanks for the continued reviews @SameerMesiah97 and many more thanks to @Nataneljpwd for the patience and repeated rework. I think this is very very good now and I hope you have a reliable state with this merged.
Very positive constructive discussion folks!
* fix an edge case where if a pod was pending in SparkKubernetes operator, the task won't fail and will recover * formatting * fixed tests and removed redundent tests * fixed pod status phase * address comment * added another test for the success case * resolve CR comments * fixed mypy issue * address cr comments * fix last test * remove deletion timestamp --------- Co-authored-by: Natanel Rudyuklakir <natanelrudyuklakir@gmail.com>
* fix an edge case where if a pod was pending in SparkKubernetes operator, the task won't fail and will recover * formatting * fixed tests and removed redundent tests * fixed pod status phase * address comment * added another test for the success case * resolve CR comments * fixed mypy issue * address cr comments * fix last test * remove deletion timestamp --------- Co-authored-by: Natanel Rudyuklakir <natanelrudyuklakir@gmail.com>
Fixes an issue where if a worker failed, and the k8s spark driver failed, restarted, and was in pending state, the operator would fail, yet we can recover, this PR allows for the recovery of the given state.
Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.