-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors #27396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
testCode: import org.apache.spark.ml.regression._
import org.apache.spark.storage.StorageLevel
val df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count
new LinearRegression().setMaxIter(10).fit(df)
val lr1 = new LinearRegression().setSolver("l-bfgs").setLoss("squaredError").setMaxIter(100)
val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = System.currentTimeMillis; end - start
val lr2 = new LinearRegression().setSolver("l-bfgs").setLoss("huber").setMaxIter(100)
val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = System.currentTimeMillis; end - start
Seq(model1, model2).map(_.summary.totalIterations)
Seq(model1, model2).map(_.summary.objectiveHistory.last)Result: Master: |
|
Test build #117550 has finished for PR 27396 at commit
|
|
retest this please |
|
Test build #117558 has finished for PR 27396 at commit
|
|
cc @srowen |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks OK pending tests, as it's consistent with other similar changes. I assume this doesn't change behavior or API - doesn't look so but just checking? and likewise perf is probably still fine on small data? Also CC @huaxingao
I know code freeze is coming tomorrow. I think we can get this in if there don't turn out to be any issues.
|
@srowen The above test is also based on |
|
Merged to master |
### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request? Revert #27360 #27396 #27374 #27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes #27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request? Revert apache#27360 apache#27396 apache#27374 apache#27389 ### Why are the changes needed? BLAS need more performace tests, specially on sparse datasets. Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression. LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression. ### Does this PR introduce any user-facing change? remove newly added param blockSize ### How was this patch tested? reverted testsuites Closes apache#27487 from zhengruifeng/revert_blockify_ii. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
What changes were proposed in this pull request?
1, use blocks instead of vectors for performance improvement
2, use Level-2 BLAS
3, move standardization of input vectors outside of gradient computation
Why are the changes needed?
1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (30% ~ 102%)
Does this PR introduce any user-facing change?
add a new expert param
blockSizeHow was this patch tested?
updated testsuites