[Bug] Spark App may hang forever if FinalStageResourceManager killed all executors

### Code of Conduct

- [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)


### Search before asking

- [X] I have searched in the [issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no similar issues.


### Describe the bug

We found a Spark Application hanged at the final stage 
<img width="2076" alt="image" src="https://github.com/apache/kyuubi/assets/88070094/70ded93a-abfd-4b8d-9b2b-5101c6297ec9">

Rerun the Application got the same result.

### Affects Version(s)

master

### Kyuubi Server Log Output

_No response_

### Kyuubi Engine Log Output

```logtalk
2023-08-02 19:54:59 CST DAGScheduler INFO - ShuffleMapStage 4 (sql at SparkSQLExecute.java:17) finished in 279.363 s
2023-08-02 19:54:59 CST YarnClusterScheduler INFO - Removed TaskSet 4.0, whose tasks have all completed, from pool default
2023-08-02 19:54:59 CST DAGScheduler INFO - looking for newly runnable stages
2023-08-02 19:54:59 CST DAGScheduler INFO - running: Set()
2023-08-02 19:54:59 CST DAGScheduler INFO - waiting: Set()
2023-08-02 19:54:59 CST DAGScheduler INFO - failed: Set()
2023-08-02 19:54:59 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:54:59 CST YarnAllocator INFO - Driver requested a total number of 1 executor(s) for resource profile id: 0.
2023-08-02 19:54:59 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13
2023-08-02 19:54:59 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13
2023-08-02 19:54:59 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13.
2023-08-02 19:54:59 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:54:59 CST ExecutorAllocationManager INFO - Executors 99,90,84,57,63,39,30,45,66,2,72,5,48,33,69,27,54,60,15,42,21,71,92,86,24,74,89,95,53,41,83,56,17,1,44,50,23,38,4,26,11,32,82,97,29,20,85,79,70,64,91,46,94,73,67,88,34,28,6,40,55,76,49,61,43,9,22,58,3,10,25,93,81,75,13 removed due to idle timeout.
2023-08-02 19:55:00 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 65
2023-08-02 19:55:00 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 65
2023-08-02 19:55:00 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 65.
2023-08-02 19:55:00 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:55:00 CST ExecutorAllocationManager INFO - Executors 65 removed due to idle timeout.
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes to previousStage, original value: 128M 
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes = 256M.
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.skewJoin.skewedPartitionFactor to previousStage, original value: 4 
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.skewJoin.skewedPartitionFactor = 5.
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.advisoryPartitionSizeInBytes to previousStage, original value: 8MB 
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.advisoryPartitionSizeInBytes = 384MB.
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.coalescePartitions.minPartitionNum to previousStage, original value: __INTERNAL_UNSET_CONFIG_TAG__ 
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.coalescePartitions.minPartitionNum = 1.
2023-08-02 19:55:00 CST ShufflePartitionsUtil INFO - For shuffle(2), advisory target size: 402653184, actual target size 21915537.
2023-08-02 19:55:00 CST FinalStageResourceManager INFO - The snapshot of current executors view, active executors: 100, min executor: 1, target executors: 1, has benefits: true
2023-08-02 19:55:01 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 51, 19
2023-08-02 19:55:01 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 51, 19
2023-08-02 19:55:01 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 51, 19.
2023-08-02 19:55:01 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:55:01 CST ExecutorAllocationManager INFO - Executors 51,19 removed due to idle timeout.
2023-08-02 19:55:02 CST FinalStageResourceManager INFO - Request to kill executors, total count 99, [88, 42, 77, 79, 2, 75, 81, 6, 15, 90, 28, 43, 63, 64, 14, 93, 70, 21, 56, 34, 10, 33, 11, 65, 61, 57, 35, 18, 3, 7, 20, 17, 32, 30, 68, 29, 86, 24, 47, 52, 38, 54, 41, 8, 9, 60, 40, 74, 4, 82, 100, 72, 45, 69, 36, 12, 46, 58, 95, 80, 44, 87, 55, 53, 5, 23, 26, 22, 97, 85, 96, 66, 59, 16, 84, 37, 48, 50, 51, 67, 39, 78, 62, 49, 71, 25, 13, 83, 89, 73, 31, 91, 19, 1, 99, 92, 94, 98, 27].
2023-08-02 19:55:02 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 88, 42, 77, 79, 2, 75, 81, 6, 15, 90, 28, 43, 63, 64, 14, 93, 70, 21, 56, 34, 10, 33, 11, 65, 61, 57, 35, 18, 3, 7, 20, 17, 32, 30, 68, 29, 86, 24, 47, 52, 38, 54, 41, 8, 9, 60, 40, 74, 4, 82, 100, 72, 45, 69, 36, 12, 46, 58, 95, 80, 44, 87, 55, 53, 5, 23, 26, 22, 97, 85, 96, 66, 59, 16, 84, 37, 48, 50, 51, 67, 39, 78, 62, 49, 71, 25, 13, 83, 89, 73, 31, 91, 19, 1, 99, 92, 94, 98, 27
2023-08-02 19:55:02 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 77, 14, 35, 18, 7, 68, 47, 52, 8, 100, 36, 12, 80, 87, 96, 59, 16, 37, 78, 62, 31, 98
2023-08-02 19:55:02 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:55:02 CST YarnAllocator INFO - Driver requested a total number of 0 executor(s) for resource profile id: 0.
2023-08-02 19:55:02 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 77, 14, 35, 18, 7, 68, 47, 52, 8, 100, 36, 12, 80, 87, 96, 59, 16, 37, 78, 62, 31, 98.
```


### Kyuubi Server Configurations

_No response_

### Kyuubi Engine Configurations

_No response_

### Additional context

Spark DRA was enabled and spark.dynamicAllocation.minExecutors was set to 1



### Are you willing to submit PR?

- [X] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
- [X] No. I cannot submit a PR at this time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Spark App may hang forever if FinalStageResourceManager killed all executors #5136

Code of Conduct

Search before asking

Describe the bug

Affects Version(s)

Kyuubi Server Log Output

Kyuubi Engine Log Output

Kyuubi Server Configurations

Kyuubi Engine Configurations

Additional context

Are you willing to submit PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Spark App may hang forever if FinalStageResourceManager killed all executors #5136

Description

Code of Conduct

Search before asking

Describe the bug

Affects Version(s)

Kyuubi Server Log Output

Kyuubi Engine Log Output

Kyuubi Server Configurations

Kyuubi Engine Configurations

Additional context

Are you willing to submit PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions