[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors #28471

zhengruifeng · 2020-05-07T03:22:41Z

What changes were proposed in this pull request?

1, add new param blockSize;
2, add a new class InstanceBlock;
3, if blockSize==1, keep original behavior; if blockSize>1, stack input vectors to blocks (like ALS/MLP);
4, if blockSize>1, standardize the input outside of optimization procedure;

Why are the changes needed?

it will obtain performance gain on dense datasets, such as epsilon
1, reduce RAM to persist traing dataset; (save about 40% RAM)
2, use Level-2 BLAS routines; (up to 6X(squaredError)~12X(huber) speedup)

Does this PR introduce any user-facing change?

Yes, a new param is added

How was this patch tested?

existing and added testsuites

init nit nit nit nit nit nit

zhengruifeng · 2020-05-07T03:23:45Z

This is a update of #27396, this PR avoids performance regression on sparse datasets with blockSize=1 by default.

SparkQA · 2020-05-07T04:45:49Z

Test build #122386 has finished for PR 28471 at commit 643470c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-05-08T02:45:57Z

testCode:


import org.apache.spark.ml.regression._
import org.apache.spark.storage.StorageLevel


val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t")
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

// Default SquaredError
val lir = new LinearRegression().setMaxIter(50).setSolver("l-bfgs")
lr.fit(df)

val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) }


// Huber
val lir = new LinearRegression().setMaxIter(50).setSolver("l-bfgs").setLoss("huber")

val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = System.currentTimeMillis; (size, model, end - start) }

Result:
Huber

scala> results.foreach(t => println(t._2.coefficients.toString.take(100)))
[0.6424729506929868,-1.2292544634895406,1.7757284244387601,0.2974549255492886,0.935593736145218,-0.7
[0.6425694454021473,-1.22927248948364,1.7757268011427252,0.2974675136619897,0.9357357899375097,-0.77
[0.6424853073405433,-1.2292568932224641,1.775728104037945,0.29745659508182537,0.9356121907344618,-0.
[0.6429959531002037,-1.229354196708174,1.7757175326389232,0.29752415047198205,0.9363681130113419,-0.
[0.6424842544336862,-1.2292566837366543,1.775728142782258,0.29745645263402737,0.9356106116206417,-0.
[0.6430623639968113,-1.2293667308116527,1.7757163246117786,0.29753287650971944,0.9364661504709426,-0
[0.6421067597074761,-1.229184819802838,1.7757358482626238,0.2974067609553463,0.9350521980706877,-0.7

scala> 

scala> results.map(_._2.summary.totalIterations)
res22: Seq[Int] = List(51, 51, 51, 51, 51, 51, 51)

scala> results.map(_._3)
res23: Seq[Long] = List(135189, 12046, 11783, 14307, 14399, 14026, 14329)

Up to 11X speedup

SquaredError


scala> results.foreach(t => println(t._2.coefficients.toString.take(100)))
[-0.2652587613623121,-0.0707048016667831,0.420750805149307,0.09194452205365045,0.05059855709172461,0
[-0.26525865677193483,-0.07070480422610806,0.42075057109293873,0.09194450660409814,0.050598582385999
[-0.26526635878841553,-0.07070459501522909,0.42076764197826116,0.09194564264821924,0.050596757952969
[-0.26525612879310023,-0.07070486362689087,0.42074489424648864,0.09194413298339,0.050599198284442995
[-0.2652611612036013,-0.07070474076806575,0.42075615827182555,0.09194487637000792,0.0505979807182646
[-0.26526190169346425,-0.07070472114668785,0.42075780340712315,0.0919449856303737,0.0505978044045689
[-0.265262317030222,-0.07070470997227678,0.42075872481615506,0.09194504690071631,0.05059770582212817

scala> results.map(_._2.summary.totalIterations)
res26: Seq[Int] = List(51, 51, 51, 51, 51, 51, 51)

scala> results.map(_._3)
res27: Seq[Long] = List(71269, 11828, 12254, 15331, 14963, 14420, 14022)

Up to 6X speedup

zhengruifeng · 2020-05-08T02:54:13Z

merged to master

HyukjinKwon · 2020-05-13T02:41:56Z

Let's don't merge without review or approval, @zhengruifeng

init

643470c

init nit nit nit nit nit nit

zhengruifeng added ML MLLIB PYSPARK PYTHON labels May 7, 2020

zhengruifeng closed this in 97332f2 May 8, 2020

zhengruifeng deleted the blockify_lir_II branch May 8, 2020 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors #28471

[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors #28471

Uh oh!

zhengruifeng commented May 7, 2020 •

edited

Loading

Uh oh!

zhengruifeng commented May 7, 2020

Uh oh!

SparkQA commented May 7, 2020

Uh oh!

zhengruifeng commented May 8, 2020

Uh oh!

zhengruifeng commented May 8, 2020

Uh oh!

HyukjinKwon commented May 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors #28471

[SPARK-30660][ML][PYSPARK] LinearRegression blockify input vectors #28471

Uh oh!

Conversation

zhengruifeng commented May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented May 7, 2020

Uh oh!

SparkQA commented May 7, 2020

Uh oh!

zhengruifeng commented May 8, 2020

Uh oh!

zhengruifeng commented May 8, 2020

Uh oh!

HyukjinKwon commented May 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhengruifeng commented May 7, 2020 •

edited

Loading