Skip to content

Conversation

@JoshRosen
Copy link
Contributor

The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.

@JoshRosen
Copy link
Contributor Author

/cc @kayousterhout @rxin.

I noticed this in some benchmarking work that I'm doing (more details on the JIRA: https://issues.apache.org/jira/browse/SPARK-7058). Before this patch, we would almost always report 1 or 2 millisecond deserialization times. Now that this metric captures all of the deserialization costs, I'm seeing tasks that spend between 70 and 150ms in deserialization.

I have several ideas of how to optimize this deserialization to reduce this time and I'll address them later patches.

@JoshRosen
Copy link
Contributor Author

As written here, I guess that this double-counts some of the time spent in execution, so I probably need to move the setting of the task start time into Task. Let me make that change now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Aren't these two things equivalent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's not necessary; will roll back to minimize diff.

@kayousterhout
Copy link
Contributor

Thanks for fixing this @JoshRosen! I've sometimes wondered if it would be helpful to specifically break out the broadcast time to help folks with debugging? In any case, this is a fine intermediate solution, even if we do decide to break out broadcast time eventually.

@JoshRosen
Copy link
Contributor Author

I've updated this patch to push the calculation of the task run time into the Task itself; this avoids double-counting of the deserialization time, which was breaking the calculation of scheduler delay.

@kayousterhout
Copy link
Contributor

It makes me a little nervous that there's now a time gap between deserializeEndTime and when taskStartTime gets calculated. This should be very small (there's just the intermediate call to updateEpoch) but sometimes things like this get bigger over time (as code changes etc.), and that will make the metrics very confusing. Can the task class expose executorDeserializeTime, and then Executor.scala can call that at the end to appropriately set all of the metrics? I also slightly prefer that approach because it consolidates the metric setting to be mostly in Executor.scala.

@kayousterhout
Copy link
Contributor

Also it is prohibitively difficult to write a unit test for this? I suspect the answer is yes...

@JoshRosen
Copy link
Contributor Author

I think that the right way to unit test this would be to get the time via the Clock interface instead of calling System.currentTimeMillis() directly, create a static Clock instance somewhere, then rig objects to advance the time on this clock instance at various points, such as deserialization, task execution, etc. This is totally doable, but it might involve a lot of changes to properly pass the Clock instance to the right places.

@JoshRosen
Copy link
Contributor Author

Exposing the time from Task seems like a better design; I've updated to incorporate this idea.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30771 has finished for PR 5635 at commit 1752f0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30777 has finished for PR 5635 at commit 21f5b47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30782 has finished for PR 5635 at commit 4f52910.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would be super helpful to have a comment here to say something to describe the two components of this, like "Deserialization happens in two parts: first, we deserialize a Task object, which includes the Partition. Second, Task.run() deserializes the RDD and function to be run."

@kayousterhout
Copy link
Contributor

One more nit: could you update the task deserialization time tooltip to explicitly say that it includes the time to read the broadcasted task?

Other than that and the other two small comments I had, LGTM!

@kayousterhout
Copy link
Contributor

LGTM!

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30866 has finished for PR 5635 at commit ed90f75.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@rxin
Copy link
Contributor

rxin commented Apr 23, 2015

LGTM!

@asfgit asfgit closed this in 6afde2c Apr 23, 2015
@JoshRosen JoshRosen deleted the SPARK-7058 branch April 24, 2015 06:19
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
…n time" metric

The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#5635 from JoshRosen/SPARK-7058 and squashes the following commits:

ed90f75 [Josh Rosen] Update UI tooltip
a3743b4 [Josh Rosen] Update comments.
4f52910 [Josh Rosen] Roll back whitespace change
e9cf9f4 [Josh Rosen] Remove unused variable
9f32e55 [Josh Rosen] Expose executorDeserializeTime on Task instead of pushing runtime calculation into Task.
21f5b47 [Josh Rosen] Don't double-count the broadcast deserialization time in task runtime
1752f0e [Josh Rosen] [SPARK-7058] Incorporate RDD deserialization time in task deserialization time metric
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…n time" metric

The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.

Author: Josh Rosen <joshrosen@databricks.com>

Closes apache#5635 from JoshRosen/SPARK-7058 and squashes the following commits:

ed90f75 [Josh Rosen] Update UI tooltip
a3743b4 [Josh Rosen] Update comments.
4f52910 [Josh Rosen] Roll back whitespace change
e9cf9f4 [Josh Rosen] Remove unused variable
9f32e55 [Josh Rosen] Expose executorDeserializeTime on Task instead of pushing runtime calculation into Task.
21f5b47 [Josh Rosen] Don't double-count the broadcast deserialization time in task runtime
1752f0e [Josh Rosen] [SPARK-7058] Incorporate RDD deserialization time in task deserialization time metric
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants