Skip to content

[Bug] Spark App may hang forever if FinalStageResourceManager killed all executors #5136

@zhouyifan279

Description

@zhouyifan279

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

We found a Spark Application hanged at the final stage
image

Rerun the Application got the same result.

Affects Version(s)

master

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

2023-08-02 19:54:59 CST DAGScheduler INFO - ShuffleMapStage 4 (sql at SparkSQLExecute.java:17) finished in 279.363 s
2023-08-02 19:54:59 CST YarnClusterScheduler INFO - Removed TaskSet 4.0, whose tasks have all completed, from pool default
2023-08-02 19:54:59 CST DAGScheduler INFO - looking for newly runnable stages
2023-08-02 19:54:59 CST DAGScheduler INFO - running: Set()
2023-08-02 19:54:59 CST DAGScheduler INFO - waiting: Set()
2023-08-02 19:54:59 CST DAGScheduler INFO - failed: Set()
2023-08-02 19:54:59 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:54:59 CST YarnAllocator INFO - Driver requested a total number of 1 executor(s) for resource profile id: 0.
2023-08-02 19:54:59 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13
2023-08-02 19:54:59 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13
2023-08-02 19:54:59 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 99, 90, 84, 57, 63, 39, 30, 45, 66, 2, 72, 5, 48, 33, 69, 27, 54, 60, 15, 42, 21, 71, 92, 86, 24, 74, 89, 95, 53, 41, 83, 56, 17, 1, 44, 50, 23, 38, 4, 26, 11, 32, 82, 97, 29, 20, 85, 79, 70, 64, 91, 46, 94, 73, 67, 88, 34, 28, 6, 40, 55, 76, 49, 61, 43, 9, 22, 58, 3, 10, 25, 93, 81, 75, 13.
2023-08-02 19:54:59 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:54:59 CST ExecutorAllocationManager INFO - Executors 99,90,84,57,63,39,30,45,66,2,72,5,48,33,69,27,54,60,15,42,21,71,92,86,24,74,89,95,53,41,83,56,17,1,44,50,23,38,4,26,11,32,82,97,29,20,85,79,70,64,91,46,94,73,67,88,34,28,6,40,55,76,49,61,43,9,22,58,3,10,25,93,81,75,13 removed due to idle timeout.
2023-08-02 19:55:00 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 65
2023-08-02 19:55:00 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 65
2023-08-02 19:55:00 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 65.
2023-08-02 19:55:00 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:55:00 CST ExecutorAllocationManager INFO - Executors 65 removed due to idle timeout.
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes to previousStage, original value: 128M 
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes = 256M.
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.skewJoin.skewedPartitionFactor to previousStage, original value: 4 
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.skewJoin.skewedPartitionFactor = 5.
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.advisoryPartitionSizeInBytes to previousStage, original value: 8MB 
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.advisoryPartitionSizeInBytes = 384MB.
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - Store config: spark.sql.adaptive.coalescePartitions.minPartitionNum to previousStage, original value: __INTERNAL_UNSET_CONFIG_TAG__ 
2023-08-02 19:55:00 CST FinalStageConfigIsolation INFO - For final stage: set spark.sql.adaptive.coalescePartitions.minPartitionNum = 1.
2023-08-02 19:55:00 CST ShufflePartitionsUtil INFO - For shuffle(2), advisory target size: 402653184, actual target size 21915537.
2023-08-02 19:55:00 CST FinalStageResourceManager INFO - The snapshot of current executors view, active executors: 100, min executor: 1, target executors: 1, has benefits: true
2023-08-02 19:55:01 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 51, 19
2023-08-02 19:55:01 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 51, 19
2023-08-02 19:55:01 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 51, 19.
2023-08-02 19:55:01 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:55:01 CST ExecutorAllocationManager INFO - Executors 51,19 removed due to idle timeout.
2023-08-02 19:55:02 CST FinalStageResourceManager INFO - Request to kill executors, total count 99, [88, 42, 77, 79, 2, 75, 81, 6, 15, 90, 28, 43, 63, 64, 14, 93, 70, 21, 56, 34, 10, 33, 11, 65, 61, 57, 35, 18, 3, 7, 20, 17, 32, 30, 68, 29, 86, 24, 47, 52, 38, 54, 41, 8, 9, 60, 40, 74, 4, 82, 100, 72, 45, 69, 36, 12, 46, 58, 95, 80, 44, 87, 55, 53, 5, 23, 26, 22, 97, 85, 96, 66, 59, 16, 84, 37, 48, 50, 51, 67, 39, 78, 62, 49, 71, 25, 13, 83, 89, 73, 31, 91, 19, 1, 99, 92, 94, 98, 27].
2023-08-02 19:55:02 CST YarnClusterSchedulerBackend INFO - Requesting to kill executor(s) 88, 42, 77, 79, 2, 75, 81, 6, 15, 90, 28, 43, 63, 64, 14, 93, 70, 21, 56, 34, 10, 33, 11, 65, 61, 57, 35, 18, 3, 7, 20, 17, 32, 30, 68, 29, 86, 24, 47, 52, 38, 54, 41, 8, 9, 60, 40, 74, 4, 82, 100, 72, 45, 69, 36, 12, 46, 58, 95, 80, 44, 87, 55, 53, 5, 23, 26, 22, 97, 85, 96, 66, 59, 16, 84, 37, 48, 50, 51, 67, 39, 78, 62, 49, 71, 25, 13, 83, 89, 73, 31, 91, 19, 1, 99, 92, 94, 98, 27
2023-08-02 19:55:02 CST YarnClusterSchedulerBackend INFO - Actual list of executor(s) to be killed is 77, 14, 35, 18, 7, 68, 47, 52, 8, 100, 36, 12, 80, 87, 96, 59, 16, 37, 78, 62, 31, 98
2023-08-02 19:55:02 CST YarnAllocator INFO - Resource profile 0 doesn't exist, adding it
2023-08-02 19:55:02 CST YarnAllocator INFO - Driver requested a total number of 0 executor(s) for resource profile id: 0.
2023-08-02 19:55:02 CST ApplicationMaster$AMEndpoint INFO - Driver requested to kill executor(s) 77, 14, 35, 18, 7, 68, 47, 52, 8, 100, 36, 12, 80, 87, 96, 59, 16, 37, 78, 62, 31, 98.

Kyuubi Server Configurations

No response

Kyuubi Engine Configurations

No response

Additional context

Spark DRA was enabled and spark.dynamicAllocation.minExecutors was set to 1

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • No. I cannot submit a PR at this time.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions