[SPARK-22001][ML][SQL] ImputerModel can do withColumn for all input columns at one pass #19229

viirya · 2017-09-14T04:10:05Z

What changes were proposed in this pull request?

SPARK-21690 makes one-pass Imputer by parallelizing the computation of all input columns. When we transform dataset with ImputerModel, we do withColumn on all input columns sequentially. We can also do this on all input columns at once by adding a withColumns API to Dataset.

The new withColumns API is for internal use only now.

How was this patch tested?

Existing tests for ImputerModel's change. Added tests for withColumns API.

viirya · 2017-09-14T04:16:58Z

Ran the similar benchmark as #18902 (comment):

numColums	Old Mean	Old Median	New Mean	New Median
1	0.3597440068	0.12441702019999998	0.2401416984	0.1535934578
10	0.5698301512999999	0.36436769210000003	0.2588411808	0.18644379360000002
100	6.6054379131	6.779679862500001	0.45659362859999997	0.4994426849

The test code is the same basically but measuring transforming time now:

import org.apache.spark.ml.feature._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import spark.implicits._
import scala.util.Random

val seed = 123l
val random = new Random(seed)
val n = 10000
val m = 100
val rows = sc.parallelize(1 to n).map(i=> Row(Array.fill(m)(random.nextDouble): _*))
val struct = new StructType(Array.range(0,m,1).map(i => StructField(s"c$i",DoubleType,true)))
val df = spark.createDataFrame(rows, struct)
df.persist()
df.count()

for (strategy <- Seq("mean", "median"); k <- Seq(1,10,100)) {
  val imputer = new Imputer().setStrategy(strategy).setInputCols(Array.range(0,k,1).map(i=>s"c$i")).setOutputCols(Array.range(0,k,1).map(i=>s"o$i"))
  var duration = 0.0
  for (i<- 0 until 10) {
    val model = imputer.fit(df)
    val start = System.nanoTime()
    model.transform(df).count
    val end = System.nanoTime()
    duration += (end - start) / 1e9
  }
  println((strategy, k, duration/10))
}

viirya · 2017-09-14T04:21:29Z

cc @MLnick @zhengruifeng @yanboliang

viirya · 2017-09-14T04:23:35Z

FYI, the withColumns API was proposed in #17819.

zhengruifeng · 2017-09-14T04:49:02Z

In the test code, should we use model.transform(df).count instead?

viirya · 2017-09-14T04:54:18Z

@zhengruifeng Yeah, it is better. Actually I think the difference between running multiple withColumn and one withColumns is mainly in the cost of query analysis and plan/dataset initialization. I will re-run the benchmark.

SparkQA · 2017-09-14T06:36:43Z

Test build #81755 has finished for PR 19229 at commit 4efb643.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-14T07:03:07Z

Test build #81756 has finished for PR 19229 at commit 4b47709.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

@viirya In your benchmark, the case numCols == 100,
the performance increase 15x, I doubt there is some mistake in benchmark. It is very possible that the testcode re-use shuffle result (when recomputation in each loop) which cause the code path skip the stage of scanning input data.

So, I hope you can re-generate testing data in every test loop. and then do model.fit. Can you update testcode and retest again?

Because the performance increasement is almost impossible in my opinion. Later I will test this by myself, if I have time. Thanks.

viirya · 2017-09-18T12:56:00Z

@WeichenXu123 Sure. And I must point out that I ran this benchmark in spark-shell under local mode. It is great if you can run the benchmark too to verify the numbers.

viirya · 2017-09-18T13:00:08Z

@WeichenXu123 Btw, the test is basically re-using the codes from #18902 (comment). Is your concern is specified for this?

WeichenXu123 · 2017-09-18T13:08:24Z

@viirya I guess the reason is, the old PR version: df.withColumn(..).withColumn(..).withColumn(..)...., the long df chain prevent the shuffle re-using... but now you merge them into one step.

viirya · 2017-09-18T13:09:26Z

Updated test codes:

import org.apache.spark.ml.feature._
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.types._
import spark.implicits._
import scala.util.Random

def genData(): DataFrame = {
  val seed = 123l
  val random = new Random(seed)
  val n = 10000
  val m = 100
  val rows = sc.parallelize(1 to n).map(i=> Row(Array.fill(m)(random.nextDouble): _*))
  val struct = new StructType(Array.range(0,m,1).map(i => StructField(s"c$i",DoubleType,true)))
  spark.createDataFrame(rows, struct)
}

for (strategy <- Seq("mean", "median"); k <- Seq(1,10,100)) {
  val imputer = new Imputer().setStrategy(strategy).setInputCols(Array.range(0,k,1).map(i=>s"c$i")).setOutputCols(Array.range(0,k,1).map(i=>s"o$i"))
  var duration = 0.0
  for (i<- 0 until 10) {
    val df = genData()
    val model = imputer.fit(df)
    val start = System.nanoTime()
    val df2 = genData()
    model.transform(df2).count
    val end = System.nanoTime()
    duration += (end - start) / 1e9
  }
  println((strategy, k, duration/10))
}

WeichenXu123 · 2017-09-18T13:09:58Z

Great! That's it. thanks!

viirya · 2017-09-18T13:16:25Z

New numbers:

numColums	Old Mean	Old Median	New Mean	New Median
1	0.17278329159999997	0.1537169693	0.16873250489999997	0.1521283075
10	0.3250422628	0.3086496881	0.16972776130000003	0.16117073769999998
100	6.3575860038	7.155657411799998	0.3299611978	0.3731574154

viirya · 2017-09-18T13:19:34Z

I don't think re-using shuffle is the reason behind the numbers. If you looked at the previous comments, you will find that I ran the test before without count after model.transform. Namely the dataframe operations haven't been triggered.

viirya · 2017-09-18T14:00:38Z

Btw, I don't see any reason about df.withColumn(..).withColumn(..).withColumn(..) can prevent the shuffle re-using.

WeichenXu123 · 2017-09-18T15:13:02Z

Looks not the reason. maybe issues somewhere else. Let me run test later. Thanks!
But there is some small issues in test:
Don't include gen data time:

    val start = System.nanoTime()
    val df2 = genData()
    model.transform(df2).count
    val end = System.nanoTime()

and add cache at the end of genData:

def genData() = {
   ....
   val df = spark.createDataframe...
  df.cache()
  df.count() // force trigger cache
  df
}

and we'd better add warm up code before record code running time.

WeichenXu123 · 2017-09-18T16:01:22Z

@viirya I run the code, you're right, most of time cost on the executedPlan generation (The old version code). thanks!
But can you append benchmark comparison with RDD.aggregate version? 8daffc9

viirya · 2017-09-18T23:25:39Z

@WeichenXu123 Thanks for verifying that.

Do you mean using ApproxQuantiles to compute mean and median? But I think this change is not intended to improve this part, but the model.transform. And I think we don't use RDD.aggregate to do the model.transform, right?

WeichenXu123 · 2017-09-19T00:18:37Z

@viirya No, keep the dataframe version code. But I only want to confirm how much performance gap between this and RDD version. (for possible improvements in the future, because in similar test I found dataframe is still slower than RDD version)

viirya · 2017-09-19T06:18:15Z

@WeichenXu123 I'm not sure I understand it correctly. This change only replaces the chain of withColumn to a pass of withColumns. We don't have RDD version for this, so I'm not sure what version you want to compare?

WeichenXu123 · 2017-09-19T06:38:22Z

Oh. That's what have done in the old PR #18902 .(Because the RDD version (not in master branch, only personal impl here (sorry for put wrong link, the code link is here: 8daffc9) will be faster than dataframe version based on current spark. Now your PR has some improvement on the perf, I would like to compare them again. We hope to track this performance gap and try to resolve it in the future. According to my other similar case, now the dataframe version will be about 2-3x slower than RDD version in the case numCols==100 for now. But if you have no time, I can help do it. Thanks!

viirya · 2017-09-20T04:09:47Z

I ran the test codes to benchmark between RDD-version and DataFrame version with this ImputerModel change:

import org.apache.spark.ml.feature._
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.types._
import spark.implicits._
import scala.util.Random

def genData(): DataFrame = {
  val seed = 123l
  val random = new Random(seed)
  val n = 10000
  val m = 100
  val rows = sc.parallelize(1 to n).map(i=> Row(Array.fill(m)(random.nextDouble): _*))
  val struct = new StructType(Array.range(0,m,1).map(i => StructField(s"c$i",DoubleType,true)))
  val df = spark.createDataFrame(rows, struct)
  df.cache()
  df.count()
  df
}

for (strategy <- Seq("mean", "median"); k <- Seq(1,10,100)) {
  val imputer = new Imputer().setStrategy(strategy).setInputCols(Array.range(0,k,1).map(i=>s"c$i")).setOutputCols(Array.range(0,k,1).map(i=>s"o$i"))
  var duration = 0.0
  for (i<- 0 until 10) {
    val df = genData()

    val start = System.nanoTime()
    val model = imputer.fit(df)
    val end = System.nanoTime()

    val df2 = genData()

    val start2 = System.nanoTime()
    model.transform(df2).count
    val end2 = System.nanoTime()

    duration += ((end - start) + (end2 - start2)) / 1e9
  }
  println((strategy, k, duration/10))
}

viirya · 2017-09-20T04:14:21Z

numColums	RDD Mean	RDD Median	DataFrame Mean	DataFrame Median
1	0.1642173481	0.199774305	0.42601806710000006	0.2025112919
10	0.3713707549	0.5290104043000001	0.43626068409999996	0.49521778340000006
100	6.8645389335	8.838674982899999	1.6645560224	2.9213964243999997

WeichenXu123 · 2017-09-20T05:00:02Z

@viirya Thanks very much! Although the perf gap exists (when numCols is large), it won't block this PR. I will create a JIRA to track this.

SparkQA · 2017-09-20T05:42:13Z

Test build #81964 has finished for PR 19229 at commit 2086900.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-22T04:56:27Z

ping @zhengruifeng @WeichenXu123 Any more comments on this? Thanks.

viirya · 2017-09-22T04:59:14Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+  /**
+   * Returns a new Dataset by adding columns or replacing the existing columns that has
+   * the same names.
+   */


@cloud-fan should have looked at this withColumns before in #17819. cc @cloud-fan to see if you has more comments.

ping @cloud-fan or @gatorsmile Can you check the SQL part? Thanks.

withColumn can be reimplemented by calling this func?

It can be. But even we want to do it, I'd prefer in a follow-up instead of this.

I still think we should do it to avoid duplicate codes.

Ok. Then I will do it in this PR.

zhengruifeng · 2017-09-22T08:41:30Z

I am not familiar with SQL source, but I think it's great to transform all columns at a time

WeichenXu123 · 2017-09-22T16:18:45Z

The performance gap issue (compared with RDD version), I create a separated JIRA to track:
https://issues.apache.org/jira/browse/SPARK-22105
As the result of offline discussion with @cloud-fan , the reason should be codegen size too large causing JVM failed to JIT. This PR should fix this #19082

viirya · 2017-09-23T01:14:33Z

Yeah, I think that fix should work for the strategy Imputer.mean because Imputer.mean aggregates many columns at once now and that can be a too large gen'd code for aggregation.

For the strategy Imputer.median, because it uses approxQuantile which calls rdd's aggregate API, I think codegen doesn't affect this part.

WeichenXu123 · 2017-09-23T01:18:49Z

@viirya Yeah the perf gap I only focus on mean which can take advantage of codegen.

viirya · 2017-09-24T14:41:01Z

@WeichenXu123 Have any more comments on this? Thanks. I think the ML part is straightforward.

WeichenXu123

The ML part looks good to me, except a minor style issue. Thanks!

WeichenXu123 · 2017-09-25T01:40:36Z

mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala

        val ic = col(inputCol)
-        outputDF = outputDF.withColumn(outputCol,
-          when(ic.isNull, surrogate)
+        when(ic.isNull, surrogate)


style: indent

This when is not a call of previous line. I think it doesn't need to indent?

Oh I misread. The style is ok.

gatorsmile · 2017-09-25T02:46:59Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+  private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
+    require(colNames.size == cols.size,
+      s"The size of column names: ${colNames.size} isn't equal to " +
+        s"the size of columns: ${cols.size}")


Also need to consider the case sensitivity issue.

Good point.

SparkQA · 2017-09-25T05:53:16Z

Test build #82141 has finished for PR 19229 at commit 07dec0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-25T06:06:55Z

@gatorsmile Added the check for case sensitivity. Please take a look again. Thanks.

viirya · 2017-09-27T02:01:09Z

ping @gatorsmile for the SQL part.

viirya · 2017-09-29T09:04:57Z

ping @gatorsmile Can you take a quick look? Thanks.

viirya · 2017-09-29T09:07:05Z

also cc @jkbradley and @MLnick for final check of the ML change. Thanks.

SparkQA · 2017-09-30T07:04:43Z

Test build #82344 has finished for PR 19229 at commit 21048a8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-30T07:17:45Z

retest this please.

SparkQA · 2017-09-30T10:03:26Z

Test build #82347 has finished for PR 19229 at commit 21048a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-30T10:28:20Z

@gatorsmile withColumn is reimplemented now. Please take a look when you have time. Thanks.

gatorsmile · 2017-10-01T05:31:17Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+        col.as(colName)
+      } else {
+        Column(field)
+      }


columnMap.find { case (colName, _) => resolver(field.name, colName) } match { case Some((colName: String, col: Column)) => col.as(colName) case _ => Column(field) }

gatorsmile · 2017-10-01T05:36:30Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+  /**
+   * Returns a new Dataset by adding columns with metadata.
+   */
+  private[spark] def withColumns(


This is not being used and tested. Could we remove it?

Ok. We can add this when we need it.

SparkQA · 2017-10-01T08:24:44Z

Test build #82369 has finished for PR 19229 at commit 1292ce0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-01T17:48:42Z

LGTM

gatorsmile · 2017-10-02T15:23:58Z

Thanks! Merged to master.

viirya · 2017-10-02T15:35:23Z

Thanks @gatorsmile @WeichenXu123 @zhengruifeng

Do withColumn on all input columns at once.

4b47709

viirya force-pushed the SPARK-22001 branch from 4efb643 to 4b47709 Compare September 14, 2017 04:19

WeichenXu123 reviewed Sep 18, 2017

View reviewed changes

viirya mentioned this pull request Sep 19, 2017

[SPARK-20542][ML][SQL] Add an API to Bucketizer that can bin multiple columns #17819

Closed

Sync withColumns and related test.

2086900

viirya commented Sep 22, 2017

View reviewed changes

WeichenXu123 reviewed Sep 25, 2017

View reviewed changes

gatorsmile reviewed Sep 25, 2017

View reviewed changes

Address case sensitivity in withColumns.

07dec0f

Address comment.

21048a8

gatorsmile reviewed Oct 1, 2017

View reviewed changes

Address comment.

1292ce0

asfgit closed this in 3ca3670 Oct 1, 2017

viirya deleted the SPARK-22001 branch December 27, 2023 18:21

[SPARK-22001][ML][SQL] ImputerModel can do withColumn for all input columns at one pass #19229

[SPARK-22001][ML][SQL] ImputerModel can do withColumn for all input columns at one pass #19229

Uh oh!

Conversation

viirya commented Sep 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Sep 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Sep 14, 2017

Uh oh!

viirya commented Sep 14, 2017

Uh oh!

zhengruifeng commented Sep 14, 2017

Uh oh!

viirya commented Sep 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 14, 2017

Uh oh!

SparkQA commented Sep 14, 2017

Uh oh!

WeichenXu123 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Sep 18, 2017

Uh oh!

viirya commented Sep 18, 2017

Uh oh!

WeichenXu123 commented Sep 18, 2017

Uh oh!

viirya commented Sep 18, 2017

Uh oh!

WeichenXu123 commented Sep 18, 2017

Uh oh!

viirya commented Sep 18, 2017

Uh oh!

viirya commented Sep 18, 2017

Uh oh!

viirya commented Sep 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeichenXu123 commented Sep 18, 2017

Uh oh!

WeichenXu123 commented Sep 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Sep 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeichenXu123 commented Sep 19, 2017

Uh oh!

viirya commented Sep 19, 2017

Uh oh!

WeichenXu123 commented Sep 19, 2017

Uh oh!

viirya commented Sep 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Sep 20, 2017

Uh oh!

WeichenXu123 commented Sep 20, 2017

Uh oh!

SparkQA commented Sep 20, 2017

Uh oh!

viirya commented Sep 22, 2017

Uh oh!

viirya Sep 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Sep 14, 2017 •

edited

Loading

viirya commented Sep 14, 2017 •

edited

Loading

WeichenXu123 left a comment •

edited

Loading

viirya commented Sep 18, 2017 •

edited

Loading

WeichenXu123 commented Sep 18, 2017 •

edited

Loading

viirya commented Sep 18, 2017 •

edited

Loading

viirya commented Sep 20, 2017 •

edited

Loading

viirya Sep 22, 2017 •

edited

Loading

WeichenXu123 commented Sep 22, 2017 •

edited

Loading