-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22001][ML][SQL] ImputerModel can do withColumn for all input columns at one pass #19229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ran the similar benchmark as #18902 (comment):
The test code is the same basically but measuring transforming time now: |
|
FYI, the |
|
In the test code, should we use |
|
@zhengruifeng Yeah, it is better. Actually I think the difference between running multiple |
|
Test build #81755 has finished for PR 19229 at commit
|
|
Test build #81756 has finished for PR 19229 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya In your benchmark, the case numCols == 100,
the performance increase 15x, I doubt there is some mistake in benchmark. It is very possible that the testcode re-use shuffle result (when recomputation in each loop) which cause the code path skip the stage of scanning input data.
So, I hope you can re-generate testing data in every test loop. and then do model.fit. Can you update testcode and retest again?
Because the performance increasement is almost impossible in my opinion. Later I will test this by myself, if I have time. Thanks.
|
@WeichenXu123 Sure. And I must point out that I ran this benchmark in spark-shell under local mode. It is great if you can run the benchmark too to verify the numbers. |
|
@WeichenXu123 Btw, the test is basically re-using the codes from #18902 (comment). Is your concern is specified for this? |
|
@viirya I guess the reason is, the old PR version: |
|
Updated test codes: |
|
Great! That's it. thanks! |
|
New numbers:
|
|
I don't think re-using shuffle is the reason behind the numbers. If you looked at the previous comments, you will find that I ran the test before without |
|
Btw, I don't see any reason about |
|
Looks not the reason. maybe issues somewhere else. Let me run test later. Thanks! and add cache at the end of genData: and we'd better add warm up code before record code running time. |
|
@WeichenXu123 Thanks for verifying that. Do you mean using ApproxQuantiles to compute mean and median? But I think this change is not intended to improve this part, but the |
|
@viirya No, keep the dataframe version code. But I only want to confirm how much performance gap between this and RDD version. (for possible improvements in the future, because in similar test I found dataframe is still slower than RDD version) |
|
@WeichenXu123 I'm not sure I understand it correctly. This change only replaces the chain of |
|
Oh. That's what have done in the old PR #18902 .(Because the RDD version (not in master branch, only personal impl here (sorry for put wrong link, the code link is here: 8daffc9) will be faster than dataframe version based on current spark. Now your PR has some improvement on the perf, I would like to compare them again. We hope to track this performance gap and try to resolve it in the future. According to my other similar case, now the dataframe version will be about 2-3x slower than RDD version in the case numCols==100 for now. But if you have no time, I can help do it. Thanks! |
|
I ran the test codes to benchmark between RDD-version and DataFrame version with this |
|
|
@viirya Thanks very much! Although the perf gap exists (when numCols is large), it won't block this PR. I will create a JIRA to track this. |
|
Test build #81964 has finished for PR 19229 at commit
|
|
ping @zhengruifeng @WeichenXu123 Any more comments on this? Thanks. |
| /** | ||
| * Returns a new Dataset by adding columns or replacing the existing columns that has | ||
| * the same names. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan should have looked at this withColumns before in #17819. cc @cloud-fan to see if you has more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @cloud-fan or @gatorsmile Can you check the SQL part? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withColumn can be reimplemented by calling this func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be. But even we want to do it, I'd prefer in a follow-up instead of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should do it to avoid duplicate codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Then I will do it in this PR.
|
I am not familiar with SQL source, but I think it's great to transform all columns at a time |
|
The performance gap issue (compared with RDD version), I create a separated JIRA to track: |
|
Yeah, I think that fix should work for the strategy For the strategy |
|
@viirya Yeah the perf gap I only focus on |
|
@WeichenXu123 Have any more comments on this? Thanks. I think the ML part is straightforward. |
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ML part looks good to me, except a minor style issue. Thanks!
| val ic = col(inputCol) | ||
| outputDF = outputDF.withColumn(outputCol, | ||
| when(ic.isNull, surrogate) | ||
| when(ic.isNull, surrogate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This when is not a call of previous line. I think it doesn't need to indent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I misread. The style is ok.
| private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = { | ||
| require(colNames.size == cols.size, | ||
| s"The size of column names: ${colNames.size} isn't equal to " + | ||
| s"the size of columns: ${cols.size}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to consider the case sensitivity issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
|
Test build #82141 has finished for PR 19229 at commit
|
|
@gatorsmile Added the check for case sensitivity. Please take a look again. Thanks. |
|
ping @gatorsmile for the SQL part. |
|
ping @gatorsmile Can you take a quick look? Thanks. |
|
also cc @jkbradley and @MLnick for final check of the ML change. Thanks. |
|
Test build #82344 has finished for PR 19229 at commit
|
|
retest this please. |
|
Test build #82347 has finished for PR 19229 at commit
|
|
@gatorsmile |
| col.as(colName) | ||
| } else { | ||
| Column(field) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columnMap.find { case (colName, _) =>
resolver(field.name, colName)
} match {
case Some((colName: String, col: Column)) => col.as(colName)
case _ => Column(field)
}| /** | ||
| * Returns a new Dataset by adding columns with metadata. | ||
| */ | ||
| private[spark] def withColumns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not being used and tested. Could we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. We can add this when we need it.
|
Test build #82369 has finished for PR 19229 at commit
|
|
LGTM |
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
SPARK-21690 makes one-pass
Imputerby parallelizing the computation of all input columns. When we transform dataset withImputerModel, we dowithColumnon all input columns sequentially. We can also do this on all input columns at once by adding awithColumnsAPI toDataset.The new
withColumnsAPI is for internal use only now.How was this patch tested?
Existing tests for
ImputerModel's change. Added tests forwithColumnsAPI.