[SPARK-13742][Core] Add non-iterator interface to RandomSampler #11578

viirya · 2016-03-08T14:08:41Z

JIRA: https://issues.apache.org/jira/browse/SPARK-13742

What changes were proposed in this pull request?

RandomSampler.sample currently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen of Sampler operator #11517. This change is to add non-iterator interface to RandomSampler.

This change adds a new method def sample(): Int to the trait RandomSampler. As we don't need to know the actual values of the sampling items, so this new method takes no arguments.

This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled.

For BernoulliSampler and BernoulliCellSampler, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not.

For PoissonSampler, the returned value can be more than 1, meaning the next item will be sampled multiple times.

How was this patch tested?

Tests are added into RandomSamplerSuite.

viirya · 2016-03-08T14:09:30Z

cc @mengxr @rxin @holdenk

SparkQA · 2016-03-08T15:59:35Z

Test build #52669 has finished for PR 11578 at commit 28fca54.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-09T03:29:02Z

Test build #52710 has finished for PR 11578 at commit 992a356.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-10T00:35:02Z

core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala

+    if (ub - lb <= 0.0) {
+      if (complement) 1 else 0
+    } else {
+      if (complement) {


This could be simplified as:

val x = rng.nextDouble() val n = if ((x >= lb) && (x < ub)) 1 else 0 if (complement) 1 - n else n

Yea. Will change it.

davies · 2016-03-10T00:41:11Z

@viirya I'm wondering that should we generate the code for these sample() to get rid of the function call and branches. Maybe the JIT compiler could do that or not, could you try to benchmark one to see the difference?

viirya · 2016-03-10T03:33:20Z

@davies Sure. I will benchmark that today.

…-iterator

viirya · 2016-03-10T04:16:17Z

Note: this benchmark is wrong. Updated below.

@davies I just benchmark that using generated code for sample() with #11517.

When withReplacement = false.

Without generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         16460 / 17161         31.9          31.4       1.0X
range/sample/sum codegen=true            4081 / 5390        128.5           7.8       4.0X

With generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         14018 / 16011         37.4          26.7       1.0X
range/sample/sum codegen=true            2476 / 2719        211.7           4.7       5.7X

Looks like it is faster with generated code.

I will benchmark for withReplacement = true later.

viirya · 2016-03-10T04:57:02Z

When withReplacement = true.

Without generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         55656 / 56490          9.4         106.2       1.0X
range/sample/sum codegen=true          35423 / 35758         14.8          67.6       1.6X

With generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         53058 / 54023          9.9         101.2       1.0X
range/sample/sum codegen=true          33153 / 33509         15.8          63.2       1.6X

So when withReplacement = true, the difference is insignificant.

viirya · 2016-03-10T05:00:30Z

From the benchmark, using generated codes of sample() much benefits to the case `withReplacement = false`. I think it is because there are many branches in the sample() of `BernoulliCellSampler`. The sample() of `PoissonSampler` is simpler.

viirya · 2016-03-10T05:04:44Z

@davies What do your think? As the performance difference is insignificant, do we want to use generated codes for sample() in wholestage codegen Sample operator?

SparkQA · 2016-03-10T05:57:42Z

Test build #52812 has finished for PR 11578 at commit 4676940.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-03-10T06:21:14Z

I made a mistake in previous benchmark for withReplacement = false.

Update benchmark here.

Without generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         16460 / 17161         31.9          31.4       1.0X
range/sample/sum codegen=true            4081 / 5390        128.5           7.8       4.0X

With generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         13688 / 17769         38.3          26.1       1.0X
range/sample/sum codegen=true            3908 / 3970        134.2           7.5       3.5X

The difference is insignificant too.

davies · 2016-03-10T17:53:18Z

@viirya Thanks for the benchmark, it seems that there is no much difference between using the Java version and generated Java code, right? Then, we should go with the simpler approach (the current one).

davies · 2016-03-10T17:57:14Z

cc @mengxr

viirya · 2016-03-13T00:09:04Z

ping @mengxr @rxin

viirya · 2016-03-15T02:17:39Z

ping @rxin @mengxr Can you review this to see if it is ok? Thanks.

davies · 2016-03-15T17:07:34Z

core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala

+  private val lnq = math.log1p(-f)
+
+  /** Return true if the next item should be sampled. Otherwise, return false. */
+  def sample(): Boolean = {


Should this return int (to be consistent with others)?

SparkQA · 2016-03-16T08:17:10Z

Test build #53280 has finished for PR 11578 at commit 980a963.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-03-17T00:19:05Z

What is the sampling probability in your benchmark? Gap sampling is only useful when the sampling probability is small. Otherwise, we still need to generate many random numbers, which is probably more expensive than an iterator call.

viirya · 2016-03-17T00:33:32Z

Both are 0.8.

viirya · 2016-03-17T00:39:21Z

Do we generate more random numbers than iterator call? I think it should be the same. Besides, the non-iterator api should only be used in Sampler codegen #11517. It is expensive to create a new iterator from values passed from Sampler node's child operator.

viirya · 2016-03-24T08:17:03Z

retest this please.

SparkQA · 2016-03-24T08:29:00Z

Test build #54014 has finished for PR 11578 at commit d51d553.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-24T09:59:34Z

Test build #54023 has finished for PR 11578 at commit d51d553.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-03-24T15:12:00Z

retest this please.

SparkQA · 2016-03-24T17:02:12Z

Test build #54047 has finished for PR 11578 at commit d51d553.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-03-25T00:53:41Z

retest this please.

viirya · 2016-03-25T03:39:31Z

retest this please.

SparkQA · 2016-03-25T05:07:36Z

Test build #54129 has finished for PR 11578 at commit d51d553.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-25T05:28:01Z

Test build #54141 has finished for PR 11578 at commit d51d553.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-03-25T06:05:20Z

retest this please.

SparkQA · 2016-03-25T08:12:49Z

Test build #54153 has finished for PR 11578 at commit d51d553.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-03-25T08:45:49Z

this time it failed at pyspark GaussianMixtureModel...

viirya · 2016-03-25T08:45:57Z

retest this please.

viirya · 2016-03-25T11:53:10Z

retest this please.

SparkQA · 2016-03-25T12:57:32Z

Test build #54163 has finished for PR 11578 at commit d51d553.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-25T14:01:54Z

Test build #54170 has finished for PR 11578 at commit d51d553.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…-iterator

SparkQA · 2016-03-25T16:38:23Z

Test build #54179 has finished for PR 11578 at commit 5319ced.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-27T09:17:10Z

Test build #54278 has finished for PR 11578 at commit ef1be44.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class GapSamplingReplacement(

viirya · 2016-03-27T13:17:58Z

@mengxr @rxin @nongli @holdenk The tests are passed. Iterator and non-iterator methods are now using same codes. Please take a look this. Thanks!

davies · 2016-03-28T06:17:58Z

core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala

+
  override def sample(items: Iterator[T]): Iterator[T] = {
    if (ub - lb <= 0.0) {
      if (complement) items else Iterator.empty


I think we don't need to optimize this corner case, it's fine to use the default sample

ok. Let me remove it.

SparkQA · 2016-03-28T11:52:20Z

Test build #54305 has finished for PR 11578 at commit b8e031b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-03-28T16:58:08Z

LGTM

davies · 2016-03-28T16:59:14Z

Merging into master, thanks!

viirya and others added 3 commits March 8, 2016 15:27

init import.

9ef2b98

Make it work.

4b37774

Add tests.

28fca54

Add Serializable.

992a356

viirya mentioned this pull request Mar 9, 2016

[SPARK-13674][SQL] Add wholestage codegen support to Sample #11517

Closed

davies reviewed Mar 10, 2016
View reviewed changes

viirya added 2 commits March 10, 2016 11:34

Merge remote-tracking branch 'upstream/master' into random-sampler-no…

3bd5ba1

…-iterator

Address comment.

4676940

davies reviewed Mar 15, 2016
View reviewed changes

Address comments.

980a963

Merge remote-tracking branch 'upstream/master' into random-sampler-no…

5319ced

…-iterator

Fix a typo...

ef1be44

davies reviewed Mar 28, 2016
View reviewed changes

Use default sample method.

b8e031b

asfgit closed this in 68c0c46 Mar 28, 2016

viirya deleted the random-sampler-no-iterator branch December 27, 2023 18:19

[SPARK-13742][Core] Add non-iterator interface to RandomSampler #11578

[SPARK-13742][Core] Add non-iterator interface to RandomSampler #11578

Uh oh!

Conversation

viirya commented Mar 8, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Mar 8, 2016

Uh oh!

SparkQA commented Mar 8, 2016

Uh oh!

SparkQA commented Mar 9, 2016

Uh oh!

davies Mar 10, 2016

Choose a reason for hiding this comment

Uh oh!

viirya Mar 10, 2016

Choose a reason for hiding this comment

Uh oh!

davies commented Mar 10, 2016

Uh oh!

viirya commented Mar 10, 2016

Uh oh!

viirya commented Mar 10, 2016

Uh oh!

viirya commented Mar 10, 2016

Uh oh!

viirya commented Mar 10, 2016

Uh oh!

viirya commented Mar 10, 2016

Uh oh!

SparkQA commented Mar 10, 2016

Uh oh!

viirya commented Mar 10, 2016

Uh oh!

davies commented Mar 10, 2016

Uh oh!

davies commented Mar 10, 2016

Uh oh!

viirya commented Mar 13, 2016

Uh oh!

viirya commented Mar 15, 2016

Uh oh!

davies Mar 15, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 16, 2016

Uh oh!

mengxr commented Mar 17, 2016

Uh oh!

viirya commented Mar 17, 2016

Uh oh!

viirya commented Mar 17, 2016

Uh oh!

viirya commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

viirya commented Mar 24, 2016

Uh oh!

SparkQA commented Mar 24, 2016

Uh oh!

viirya commented Mar 25, 2016

Uh oh!

viirya commented Mar 25, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

viirya commented Mar 25, 2016

Uh oh!

SparkQA commented Mar 25, 2016

Uh oh!

viirya commented Mar 25, 2016

Uh oh!

viirya commented Mar 25, 2016