Robustness to the driver pod taking some time to create #2302

Tom-Newton · 2024-10-31T11:45:51Z

What feature you would like to be added?

Currently if the driver pod still reports not found for too long after successfully running spark-submit then the spark application fails with "applicationState":{"errorMessage":"driver pod not found","state":"FAILED"}

The following logs show an example where this occurred on our prod k8s cluster.
spark-opeartor-logs.txt

The spark-submit command runs successfully and we see a log from the webhook for it mutating the driver pod at 2024-10-30T11:07:28.331Z. The first reconcile after creating the spark application puts it in SUBMITTED state.

The next reconcile at 2024-10-30 11:07:28.624 finds that the driver pod doesn't exist yet and the spark application fails

spark-operator/internal/controller/sparkapplication/controller.go

Lines 768 to 769 in 5507800

    
           app.Status.AppState.State = v1beta2.ApplicationStateFailing 
        
           app.Status.AppState.ErrorMessage = "driver pod not found"

.

300ms later at 2024-10-30T11:07:28.924Z the spark operator detects that the driver pod has been created, but it's too late. The SparkApplication is already in FAILED state.

Why is this needed?

To improve reliability

Describe the solution you would like

Make the spark-operator robust to this internally and never fail the spark application. One way to implement this would be to not change the application state at

spark-operator/internal/controller/sparkapplication/controller.go

Lines 768 to 769 in 5507800

    
           app.Status.AppState.State = v1beta2.ApplicationStateFailing 
        
           app.Status.AppState.ErrorMessage = "driver pod not found"

unless the time since submission is greater than some threshold (maybe 10 seconds).

Describe alternatives you have considered

Set onFailureRetries to a non-zero value. The problem with this is that most errors we don't want to retry.

Additional context

No response

Love this feature?

Give it a 👍 We prioritize the features with most 👍

The text was updated successfully, but these errors were encountered:

Tom-Newton · 2024-11-10T17:26:08Z

I made a PR #2315 and we are now using something very similar internally.

Tom-Newton added the kind/feature label Oct 31, 2024

This was referenced Nov 7, 2024

Tomnewton/robustness to driver pod taking time to create Tom-Newton/spark-operator#2

Draft

Robustness to driver pod taking time to create #2315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robustness to the driver pod taking some time to create #2302

Robustness to the driver pod taking some time to create #2302

Tom-Newton commented Oct 31, 2024

Tom-Newton commented Nov 10, 2024

Robustness to the driver pod taking some time to create #2302

Robustness to the driver pod taking some time to create #2302

Comments

Tom-Newton commented Oct 31, 2024

What feature you would like to be added?

Why is this needed?

Describe the solution you would like

Describe alternatives you have considered

Additional context

Love this feature?

Tom-Newton commented Nov 10, 2024