You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently if the driver pod still reports not found for too long after successfully running spark-submit then the spark application fails with "applicationState":{"errorMessage":"driver pod not found","state":"FAILED"}
The following logs show an example where this occurred on our prod k8s cluster. spark-opeartor-logs.txt
The spark-submit command runs successfully and we see a log from the webhook for it mutating the driver pod at 2024-10-30T11:07:28.331Z. The first reconcile after creating the spark application puts it in SUBMITTED state.
The next reconcile at 2024-10-30 11:07:28.624 finds that the driver pod doesn't exist yet and the spark application fails
app.Status.AppState.ErrorMessage="driver pod not found"
.
300ms later at 2024-10-30T11:07:28.924Z the spark operator detects that the driver pod has been created, but it's too late. The SparkApplication is already in FAILED state.
Why is this needed?
To improve reliability
Describe the solution you would like
Make the spark-operator robust to this internally and never fail the spark application. One way to implement this would be to not change the application state at
What feature you would like to be added?
Currently if the driver pod still reports not found for too long after successfully running
spark-submit
then the spark application fails with"applicationState":{"errorMessage":"driver pod not found","state":"FAILED"}
The following logs show an example where this occurred on our prod k8s cluster.
spark-opeartor-logs.txt
spark-submit
command runs successfully and we see a log from the webhook for it mutating the driver pod at 2024-10-30T11:07:28.331Z. The first reconcile after creating the spark application puts it in SUBMITTED state.spark-operator/internal/controller/sparkapplication/controller.go
Lines 768 to 769 in 5507800
Why is this needed?
To improve reliability
Describe the solution you would like
Make the
spark-operator
robust to this internally and never fail the spark application. One way to implement this would be to not change the application state atspark-operator/internal/controller/sparkapplication/controller.go
Lines 768 to 769 in 5507800
Describe alternatives you have considered
Set
onFailureRetries
to a non-zero value. The problem with this is that most errors we don't want to retry.Additional context
No response
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: