-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21690][ML] one-pass imputer #18902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #80477 has finished for PR 18902 at commit
|
f6f166f to
61ccb5d
Compare
|
Test build #80478 has finished for PR 18902 at commit
|
|
Jenkis, retest this please |
|
Test build #80479 has finished for PR 18902 at commit
|
|
Hi @zhengruifeng Thanks for the idea and implementation. Definitely something worth exploring. As I understand, the new implementation improves the locality yet it leverages RDD API instead of Dataset API. Since overall this targets a performance improvement, I'd be interested to see the performance comparison. |
|
@hhbyyh Yes, I will test the performance. |
|
I test the performance on a small data, the value in the following table is the average duration in seconds:
We can see that, even on a small data, the speedup is significant. and the test code is here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style - use dot notation here not infix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: space between if and (
660c2db to
8283411
Compare
|
Test build #80653 has finished for PR 18902 at commit
|
|
Eh, I meant that it may be possible to get the mean values purely using DataFrame API. (convert missingValue/NaN to null) in one pass, so we may need to check the performance comparison. But I guess it looks a little hack. For the median value, it may be harder so we can use the RDD API. (to be confirmed). |
|
@hhbyyh Good Idea! We can also use this trick to compute median, because method spark/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala Line 65 in 0e80eca
null and NaN
|
|
Test build #80660 has finished for PR 18902 at commit
|
|
Jenkins, retest this please |
|
Test build #80663 has finished for PR 18902 at commit
|
|
Test build #80666 has finished for PR 18902 at commit
|
|
Test build #80667 has finished for PR 18902 at commit
|
|
Test build #80675 has finished for PR 18902 at commit
|
|
@hhbyyh I rewrite the impl, and now all |
|
Thanks for the quick update. The implementation may be improved on some details. But first I'd want to confirm the "convert to null" method does not have any defect. And we may need more unit tests to constantly monitor the SQL behavior (avg and stat) on null. |
|
I test on dataframes containing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add API annotation to clarify that the relative error of median is 0.001 if strategy == median.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, add annotation here to clarify approxQuantile function will ignore null automatically.
|
@hhbyyh @zhengruifeng I'm ok with the convert to null method, I think there is no extra pass for data if we handle it in this way, and the DataFrame/RDD functions to compute mean/median will ignore null . Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add annotation here to clarify avg function will ignore null automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, add annotation here to clarify approxQuantile function will ignore null automatically.
5921f51 to
495d701
Compare
|
@zhengruifeng Could you verify & compare the performance of this new DF-based approach vs your original RDD-based one? |
|
Test build #80780 has finished for PR 18902 at commit
|
|
@MLnick @yanboliang @hhbyyh I update the performance comparison. |
|
@zhengruifeng What the RDD-based one means? It's the code on master or the code in your former commit? |
|
@yanboliang RDD-based impl the former commit |
|
@zhengruifeng DataFrame-based operation is 2-3x slower than RDD-based operation is a known issue, because of the deserialization cost. If we switch to RDD-based method, we need to implement our own aggregator to calculate mean and median, this need much more code than calling DataFrame API. BTW, DF using more compact structure that can reduce memory footprint. |
|
@yanboliang Although dispointed by DF's performance, I also approve the choice of DF just for less code. |
|
+1 for using Dataframe-based version code. @zhengruifeng One thing I want to confirm is that, I check your testing code, both RDD-based version and Dataframe-based version code will both cost on deserialization: when running If not, your testing has no problem, I will guess there exists other performance issue in SQL layer and cc @cloud-fan to take a look. @zhengruifeng You can paste both versions code of RDD and dataframe with there own testing code so that I can check the perf issue deeply. |
|
@WeichenXu123 No, I only cache the DataFrame. And the RDD-Version is here. |
|
hmm... that's interesting. So I found performance gap between dataframe codegen aggregation and the simple RDD aggregation. I will discuss with SQL team for this later. Thanks! |
|
Seems fine to me to use the DF version even though it's slower. But we should open a JIRA issue to track where the gap is on the SQL side of things and try to improve the performance. |
|
Sure. I will create JIRA after this perf gap is confirmed. |
|
Any more comments on this PR? It have been about one month since the last modification. |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Performance issue can be put into separated jira. Thanks! ping @yanboliang to make a final review.
yanboliang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current DF-based solution is 5-10 faster than the original implementation and can reduce memory footprint, so this look good to me.
Regards to the performance gap between DF and RDD, I found the gap is larger if there are more columns, so let's dig into the DF code generation for multiple columns, to check whether it makes some difference. Let's track the SQL issue in a separate JIRA.
|
Merged into master. Thanks for all. |
|
@MLnick Thanks for pinging me. I go through this quickly. The basic idea is the same, performing the operations on multiple inputs columns at one single Dataset/DataFrame operation. Unlike Actually I'm noticed by |
What changes were proposed in this pull request?
parallelize the computation of all columns
performance tests:
How was this patch tested?
existing tests