Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jan 30, 2020

What changes were proposed in this pull request?

1, use blocks instead of vectors for performance improvement
2, use Level-2 BLAS
3, move standardization of input vectors outside of gradient computation

Why are the changes needed?

1, less RAM to persist training data; (save ~40%)
2, faster than existing impl; (30% ~ 102%)

Does this PR introduce any user-facing change?

add a new expert param blockSize

How was this patch tested?

updated testsuites

@zhengruifeng
Copy link
Contributor Author

testCode:

import org.apache.spark.ml.regression._
import org.apache.spark.storage.StorageLevel


val df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

new LinearRegression().setMaxIter(10).fit(df)

val lr1 = new LinearRegression().setSolver("l-bfgs").setLoss("squaredError").setMaxIter(100)
val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = System.currentTimeMillis; end - start


val lr2 = new LinearRegression().setSolver("l-bfgs").setLoss("huber").setMaxIter(100)
val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = System.currentTimeMillis; end - start

Seq(model1, model2).map(_.summary.totalIterations)
Seq(model1, model2).map(_.summary.objectiveHistory.last)

Result:
This PR:
Duration: 1598, 2473
last objective: 0.30658886019384274, 0.5991272847846535
numIteration: 40, 101

Master:
Duration: 2060, 5015
last objective: 0.306588860193839, 0.5985319305006285
numIteration: 40, 101

@SparkQA
Copy link

SparkQA commented Jan 30, 2020

Test build #117550 has finished for PR 27396 at commit 2da0380.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 30, 2020

Test build #117558 has finished for PR 27396 at commit 2da0380.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

cc @srowen

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks OK pending tests, as it's consistent with other similar changes. I assume this doesn't change behavior or API - doesn't look so but just checking? and likewise perf is probably still fine on small data? Also CC @huaxingao

I know code freeze is coming tomorrow. I think we can get this in if there don't turn out to be any issues.

@zhengruifeng
Copy link
Contributor Author

@srowen The above test is also based on a9a, a small dataset with numFeatures=123, numInstances=32,561

@srowen srowen closed this in d0c3e9f Feb 1, 2020
@srowen
Copy link
Member

srowen commented Feb 1, 2020

Merged to master

@zhengruifeng zhengruifeng deleted the blockify_lireg branch February 1, 2020 03:56
@zhengruifeng zhengruifeng restored the blockify_lireg branch February 6, 2020 08:03
zhengruifeng added a commit that referenced this pull request Feb 8, 2020
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
zhengruifeng added a commit that referenced this pull request Feb 25, 2020
### What changes were proposed in this pull request?
Revert
#27360
#27396
#27374
#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes #27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Revert
apache#27360
apache#27396
apache#27374
apache#27389

### Why are the changes needed?
BLAS need more performace tests, specially on sparse datasets.
Perfermance test of LogisticRegression (apache#27374) on sparse dataset shows that blockify vectors to matrices and use BLAS will cause performance regression.
LinearSVC and LinearRegression were also updated in the same way as LogisticRegression, so we need to revert them to make sure no regression.

### Does this PR introduce any user-facing change?
remove newly added param blockSize

### How was this patch tested?
reverted testsuites

Closes apache#27487 from zhengruifeng/revert_blockify_ii.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
@zhengruifeng zhengruifeng deleted the blockify_lireg branch February 25, 2021 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants