Skip to content

Conversation

@vanzin
Copy link
Owner

@vanzin vanzin commented Aug 10, 2017

This change is a little larger because there's a whole lot of logic
behind these pages, all really tied to internal types and listeners.
There's also a lot of code that was moved to the new module.

  • Added missing StageData and ExecutorStageSummary fields which are
    used by the UI. Some json golden files needed to be updated to account
    for new fields.

  • Save RDD graph data in the store. This tries to re-use existing types as
    much as possible, so that the code doesn't need to be re-written. So it's
    probably not very optimal.

  • Some old classes (e.g. JobProgressListener) still remain, since they're used
    in other parts of the code; they're not used by the UI anymore, though, and
    will be cleaned up in a separate change.

  • Save information about active pools in the store. This data is not really used
    in the SHS, but it's not a lot of data so it's still recorded when replaying
    applications.

  • Because the new store sorts things slightly differently from the previous
    code, some json golden files had some elements within them shuffled around.

  • The retention unit test in UISeleniumSuite was disabled because the code
    to throw away old stages / tasks hasn't been added yet.

  • The job description field in the API tries to follow the old behavior, which
    makes it be empty most of the time, even though there's information to fill it
    in. For stages, a new field was added to hold the description (which is basically
    the job description), so that the UI can be rendered in the old way.

  • A new stage status ("SKIPPED") was added to account for the fact that the API
    couldn't represent that state before. Without this, the stage would show up as
    "PENDING" in the UI, which is now based on API types.

  • The API used to expose "executorRunTime" as the value of the task's duration,
    which wasn't really correct (also because that value was easily available
    from the metrics object); this change fixes that by storing the correct duration,
    which also means a few expectation files needed to be updated to account for
    the new durations and sorting differences due to the changed values.

Copy link
Collaborator

@squito squito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this one is pretty big, will take a few passes.

@cloud-fan @jerryshao @ajbozarth Would also appreciate getting more eyes on this if you want to start looking at this already


// Create the graph data for all the job's stages.
event.stageInfos.foreach { stage =>
val graph = RDDOperationGraph.makeOperationGraph(stage, Int.MaxValue)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like the config for maxNodes was lost, even in M6

@@ -63,6 +64,10 @@ private class LiveJob(
var activeTasks = 0
var completedTasks = 0
var failedTasks = 0
val completedIndices = new OpenHashSet[Long]()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment explaining this is stageId + taskIndex packed into one Long

@@ -209,6 +240,56 @@ private[spark] class AppStatusStore(store: KVStore) {
indexed.skip(offset).max(length).asScala.map(_.info).toSeq
}

private def stageWithDetails(stage: v1.StageData): v1.StageData = {
// TODO: limit tasks returned.
val maxTasks = Int.MaxValue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I should just remove this TODO. There's no way to change this without breaking the semantics of the current API endpoint.

What I did is add a new parameter to the API ("details") which controls whether the tasks are returned when you get the stage data.

{listener.schedulingMode.map(_.toString).getOrElse("Unknown")}
</li>
val completedJobs = _completedJobs.toSeq.reverse
val failedJobs = _failedJobs.toSeq.reverse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, but seems like you could create a Vector instead of a ListBuffer, and then just call reverseIterator.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that require changing all downstream calls to take an Iterator instead of a Seq?

@cloud-fan
Copy link

hmmm, is it possible to split it into smaller PRs? It's really hard to review...

@vanzin
Copy link
Owner Author

vanzin commented Oct 29, 2017

It is really hard to split this PR into smaller chunks. If I separate the API changes from the UI changes, which is really the only thing that could be done, I will have to write throw-away code to make things work until the second PR is pushed, which is something I'd like to avoid.

This change is a little larger because there's a whole lot of logic
behind these pages, all really tied to internal types and listeners.
There's also a lot of code that was moved to the new module.

- Added missing StageData and ExecutorStageSummary fields which are
  used by the UI. Some json golden files needed to be updated to account
  for new fields.

- Save RDD graph data in the store. This tries to re-use existing types as
  much as possible, so that the code doesn't need to be re-written. So it's
  probably not very optimal.

- Some old classes (e.g. JobProgressListener) still remain, since they're used
  in other parts of the code; they're not used by the UI anymore, though, and
  will be cleaned up in a separate change.

- Save information about active pools in the store. This data is not really used
  in the SHS, but it's not a lot of data so it's still recorded when replaying
  applications.

- Because the new store sorts things slightly differently from the previous
  code, some json golden files had some elements within them shuffled around.

- The retention unit test in UISeleniumSuite was disabled because the code
  to throw away old stages / tasks hasn't been added yet.

- The job description field in the API tries to follow the old behavior, which
  makes it be empty most of the time, even though there's information to fill it
  in. For stages, a new field was added to hold the description (which is basically
  the job description), so that the UI can be rendered in the old way.

- A new stage status ("SKIPPED") was added to account for the fact that the API
  couldn't represent that state before. Without this, the stage would show up as
  "PENDING" in the UI, which is now based on API types.

- The API used to expose "executorRunTime" as the value of the task's duration,
  which wasn't really correct (also because that value was easily available
  from the metrics object); this change fixes that by storing the correct duration,
  which also means a few expectation files needed to be updated to account for
  the new durations and sorting differences due to the changed values.

- Implement SPARK-20713 and SPARK-21922.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants