Skip to content

Latest commit

 

History

History
83 lines (53 loc) · 3.96 KB

spark-taskscheduler-ResultTask.adoc

File metadata and controls

83 lines (53 loc) · 3.96 KB

ResultTask

ResultTask is created with a broadcast variable with the RDD and the function to execute it on and the partition.

Table 1. ResultTask’s Internal Registries and Counters
Name Description

preferredLocs

Collection of TaskLocations.

Corresponds directly to unique entries in locs with the only rule that when locs is not defined, it is empty, and no task location preferences are defined.

Initialized when ResultTask is created.

Used exclusively when ResultTask is requested for preferred locations.

Creating ResultTask Instance

ResultTask takes the following when created:

  • stageId — the stage the task is executed for

  • stageAttemptId — the stage attempt id

  • Broadcast variable with the serialized task (as Array[Byte]). The broadcast contains of a serialized pair of RDD and the function to execute.

  • Partition to compute

  • Collection of TaskLocations, i.e. preferred locations (executors) to execute the task on

  • outputId

  • local Properties

  • The stage’s serialized TaskMetrics (as Array[Byte])

  • (optional) Job id

  • (optional) Application id

  • (optional) Application attempt id

ResultTask initializes the internal registries and counters.

preferredLocations Method

preferredLocations: Seq[TaskLocation]
Note
preferredLocations is part of Task contract.

preferredLocations simply returns preferredLocs internal property.

Deserialize RDD and Function (From Broadcast) and Execute Function (on RDD Partition) — runTask Method

runTask(context: TaskContext): U
Note
U is the type of a result as defined when ResultTask is created.

runTask deserializes a RDD and a function from the broadcast and then executes the function (on the records from the RDD partition).

Note
runTask is part of Task contract to run a task.

Internally, runTask starts by tracking the time required to deserialize a RDD and a function to execute.

Note
taskBinary broadcast is defined when ResultTask is created.

runTask records _executorDeserializeTime and _executorDeserializeCpuTime properties.

In the end, runTask executes the function (passing in the input context and the records from partition of the RDD).

Note
partition to use to access the records in a deserialized RDD is defined when ResultTask was created.