Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented May 7, 2020

What changes were proposed in this pull request?

1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;

Why are the changes needed?

it will obtain performance gain on dense datasets, such as epsilon
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (up to 6X(squaredError)~12X(huber) speedup)

Does this PR introduce any user-facing change?

Yes, a new param is added

How was this patch tested?

existing and added testsuites

init

nit

nit

nit

nit

nit

nit
@zhengruifeng
Copy link
Contributor Author

This is a update of #27396, this PR avoids performance regression on sparse datasets with blockSize=1 by default.

@SparkQA
Copy link

SparkQA commented May 7, 2020

Test build #122386 has finished for PR 28471 at commit 643470c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

testCode:


import org.apache.spark.ml.regression._
import org.apache.spark.storage.StorageLevel


val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t")
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

// Default SquaredError
val lir = new LinearRegression().setMaxIter(50).setSolver("l-bfgs")
lr.fit(df)

val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) }


// Huber
val lir = new LinearRegression().setMaxIter(50).setSolver("l-bfgs").setLoss("huber")

val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) }

Result:
Huber

scala> results.foreach(t => println(t._2.coefficients.toString.take(100)))
[0.6424729506929868,-1.2292544634895406,1.7757284244387601,0.2974549255492886,0.935593736145218,-0.7
[0.6425694454021473,-1.22927248948364,1.7757268011427252,0.2974675136619897,0.9357357899375097,-0.77
[0.6424853073405433,-1.2292568932224641,1.775728104037945,0.29745659508182537,0.9356121907344618,-0.
[0.6429959531002037,-1.229354196708174,1.7757175326389232,0.29752415047198205,0.9363681130113419,-0.
[0.6424842544336862,-1.2292566837366543,1.775728142782258,0.29745645263402737,0.9356106116206417,-0.
[0.6430623639968113,-1.2293667308116527,1.7757163246117786,0.29753287650971944,0.9364661504709426,-0
[0.6421067597074761,-1.229184819802838,1.7757358482626238,0.2974067609553463,0.9350521980706877,-0.7

scala> 

scala> results.map(_._2.summary.totalIterations)
res22: Seq[Int] = List(51, 51, 51, 51, 51, 51, 51)

scala> results.map(_._3)
res23: Seq[Long] = List(135189, 12046, 11783, 14307, 14399, 14026, 14329)

Up to 11X speedup

SquaredError


scala> results.foreach(t => println(t._2.coefficients.toString.take(100)))
[-0.2652587613623121,-0.0707048016667831,0.420750805149307,0.09194452205365045,0.05059855709172461,0
[-0.26525865677193483,-0.07070480422610806,0.42075057109293873,0.09194450660409814,0.050598582385999
[-0.26526635878841553,-0.07070459501522909,0.42076764197826116,0.09194564264821924,0.050596757952969
[-0.26525612879310023,-0.07070486362689087,0.42074489424648864,0.09194413298339,0.050599198284442995
[-0.2652611612036013,-0.07070474076806575,0.42075615827182555,0.09194487637000792,0.0505979807182646
[-0.26526190169346425,-0.07070472114668785,0.42075780340712315,0.0919449856303737,0.0505978044045689
[-0.265262317030222,-0.07070470997227678,0.42075872481615506,0.09194504690071631,0.05059770582212817

scala> results.map(_._2.summary.totalIterations)
res26: Seq[Int] = List(51, 51, 51, 51, 51, 51, 51)

scala> results.map(_._3)
res27: Seq[Long] = List(71269, 11828, 12254, 15331, 14963, 14420, 14022)

Up to 6X speedup

@zhengruifeng zhengruifeng deleted the blockify_lir_II branch May 8, 2020 02:54
@zhengruifeng
Copy link
Contributor Author

merged to master

@HyukjinKwon
Copy link
Member

Let's don't merge without review or approval, @zhengruifeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants