Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robustness to the driver pod taking some time to create #2302

Open
Tom-Newton opened this issue Oct 31, 2024 · 1 comment · May be fixed by #2315 or Tom-Newton/spark-operator#2
Open

Robustness to the driver pod taking some time to create #2302

Tom-Newton opened this issue Oct 31, 2024 · 1 comment · May be fixed by #2315 or Tom-Newton/spark-operator#2

Comments

@Tom-Newton
Copy link
Contributor

What feature you would like to be added?

Currently if the driver pod still reports not found for too long after successfully running spark-submit then the spark application fails with "applicationState":{"errorMessage":"driver pod not found","state":"FAILED"}

The following logs show an example where this occurred on our prod k8s cluster.
spark-opeartor-logs.txt

  1. The spark-submit command runs successfully and we see a log from the webhook for it mutating the driver pod at 2024-10-30T11:07:28.331Z. The first reconcile after creating the spark application puts it in SUBMITTED state.
  2. The next reconcile at 2024-10-30 11:07:28.624 finds that the driver pod doesn't exist yet and the spark application fails
    app.Status.AppState.State = v1beta2.ApplicationStateFailing
    app.Status.AppState.ErrorMessage = "driver pod not found"
    .
  3. 300ms later at 2024-10-30T11:07:28.924Z the spark operator detects that the driver pod has been created, but it's too late. The SparkApplication is already in FAILED state.

Why is this needed?

To improve reliability

Describe the solution you would like

Make the spark-operator robust to this internally and never fail the spark application. One way to implement this would be to not change the application state at

app.Status.AppState.State = v1beta2.ApplicationStateFailing
app.Status.AppState.ErrorMessage = "driver pod not found"
unless the time since submission is greater than some threshold (maybe 10 seconds).

Describe alternatives you have considered

Set onFailureRetries to a non-zero value. The problem with this is that most errors we don't want to retry.

Additional context

No response

Love this feature?

Give it a 👍 We prioritize the features with most 👍

@Tom-Newton
Copy link
Contributor Author

I made a PR #2315 and we are now using something very similar internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant