-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-7058] Include RDD deserialization time in "task deserialization time" metric #5635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/cc @kayousterhout @rxin. I noticed this in some benchmarking work that I'm doing (more details on the JIRA: https://issues.apache.org/jira/browse/SPARK-7058). Before this patch, we would almost always report 1 or 2 millisecond deserialization times. Now that this metric captures all of the deserialization costs, I'm seeing tasks that spend between 70 and 150ms in deserialization. I have several ideas of how to optimize this deserialization to reduce this time and I'll address them later patches. |
|
As written here, I guess that this double-counts some of the time spent in execution, so I probably need to move the setting of the task start time into Task. Let me make that change now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Aren't these two things equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it's not necessary; will roll back to minimize diff.
|
Thanks for fixing this @JoshRosen! I've sometimes wondered if it would be helpful to specifically break out the broadcast time to help folks with debugging? In any case, this is a fine intermediate solution, even if we do decide to break out broadcast time eventually. |
|
I've updated this patch to push the calculation of the task run time into the Task itself; this avoids double-counting of the deserialization time, which was breaking the calculation of scheduler delay. |
|
It makes me a little nervous that there's now a time gap between deserializeEndTime and when taskStartTime gets calculated. This should be very small (there's just the intermediate call to updateEpoch) but sometimes things like this get bigger over time (as code changes etc.), and that will make the metrics very confusing. Can the task class expose executorDeserializeTime, and then Executor.scala can call that at the end to appropriately set all of the metrics? I also slightly prefer that approach because it consolidates the metric setting to be mostly in Executor.scala. |
|
Also it is prohibitively difficult to write a unit test for this? I suspect the answer is yes... |
|
I think that the right way to unit test this would be to get the time via the |
|
Exposing the time from Task seems like a better design; I've updated to incorporate this idea. |
|
Test build #30771 has finished for PR 5635 at commit
|
|
Test build #30777 has finished for PR 5635 at commit
|
|
Test build #30782 has finished for PR 5635 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It would be super helpful to have a comment here to say something to describe the two components of this, like "Deserialization happens in two parts: first, we deserialize a Task object, which includes the Partition. Second, Task.run() deserializes the RDD and function to be run."
|
One more nit: could you update the task deserialization time tooltip to explicitly say that it includes the time to read the broadcasted task? Other than that and the other two small comments I had, LGTM! |
|
LGTM! |
|
Test build #30866 has finished for PR 5635 at commit
|
|
LGTM! |
…n time" metric The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#5635 from JoshRosen/SPARK-7058 and squashes the following commits: ed90f75 [Josh Rosen] Update UI tooltip a3743b4 [Josh Rosen] Update comments. 4f52910 [Josh Rosen] Roll back whitespace change e9cf9f4 [Josh Rosen] Remove unused variable 9f32e55 [Josh Rosen] Expose executorDeserializeTime on Task instead of pushing runtime calculation into Task. 21f5b47 [Josh Rosen] Don't double-count the broadcast deserialization time in task runtime 1752f0e [Josh Rosen] [SPARK-7058] Incorporate RDD deserialization time in task deserialization time metric
…n time" metric The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#5635 from JoshRosen/SPARK-7058 and squashes the following commits: ed90f75 [Josh Rosen] Update UI tooltip a3743b4 [Josh Rosen] Update comments. 4f52910 [Josh Rosen] Roll back whitespace change e9cf9f4 [Josh Rosen] Remove unused variable 9f32e55 [Josh Rosen] Expose executorDeserializeTime on Task instead of pushing runtime calculation into Task. 21f5b47 [Josh Rosen] Don't double-count the broadcast deserialization time in task runtime 1752f0e [Josh Rosen] [SPARK-7058] Incorporate RDD deserialization time in task deserialization time metric
The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.