Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Mar 8, 2016

JIRA: https://issues.apache.org/jira/browse/SPARK-13742

What changes were proposed in this pull request?

RandomSampler.sample currently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen of Sampler operator #11517. This change is to add non-iterator interface to RandomSampler.

This change adds a new method def sample(): Int to the trait RandomSampler. As we don't need to know the actual values of the sampling items, so this new method takes no arguments.

This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled.

For BernoulliSampler and BernoulliCellSampler, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not.

For PoissonSampler, the returned value can be more than 1, meaning the next item will be sampled multiple times.

How was this patch tested?

Tests are added into RandomSamplerSuite.

@viirya
Copy link
Member Author

viirya commented Mar 8, 2016

cc @mengxr @rxin @holdenk

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52669 has finished for PR 11578 at commit 28fca54.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52710 has finished for PR 11578 at commit 992a356.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (ub - lb <= 0.0) {
if (complement) 1 else 0
} else {
if (complement) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified as:

val x = rng.nextDouble()
val n = if ((x >= lb) && (x < ub)) 1 else 0
if (complement) 1 - n else n

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. Will change it.

@davies
Copy link
Contributor

davies commented Mar 10, 2016

@viirya I'm wondering that should we generate the code for these sample() to get rid of the function call and branches. Maybe the JIT compiler could do that or not, could you try to benchmark one to see the difference?

@viirya
Copy link
Member Author

viirya commented Mar 10, 2016

@davies Sure. I will benchmark that today.

@viirya
Copy link
Member Author

viirya commented Mar 10, 2016

Note: this benchmark is wrong. Updated below.

@davies I just benchmark that using generated code for sample() with #11517.

When withReplacement = false.

Without generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         16460 / 17161         31.9          31.4       1.0X
range/sample/sum codegen=true            4081 / 5390        128.5           7.8       4.0X

With generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         14018 / 16011         37.4          26.7       1.0X
range/sample/sum codegen=true            2476 / 2719        211.7           4.7       5.7X

Looks like it is faster with generated code.

I will benchmark for withReplacement = true later.

@viirya
Copy link
Member Author

viirya commented Mar 10, 2016

When withReplacement = true.

Without generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         55656 / 56490          9.4         106.2       1.0X
range/sample/sum codegen=true          35423 / 35758         14.8          67.6       1.6X

With generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         53058 / 54023          9.9         101.2       1.0X
range/sample/sum codegen=true          33153 / 33509         15.8          63.2       1.6X

So when withReplacement = true, the difference is insignificant.

@viirya
Copy link
Member Author

viirya commented Mar 10, 2016

From the benchmark, using generated codes of sample() much benefits to the case `withReplacement = false`. I think it is because there are many branches in the sample() of `BernoulliCellSampler`. The sample() of `PoissonSampler` is simpler.

@viirya
Copy link
Member Author

viirya commented Mar 10, 2016

@davies What do your think? As the performance difference is insignificant, do we want to use generated codes for sample() in wholestage codegen Sample operator?

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52812 has finished for PR 11578 at commit 4676940.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 10, 2016

I made a mistake in previous benchmark for withReplacement = false.

Update benchmark here.

Without generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         16460 / 17161         31.9          31.4       1.0X
range/sample/sum codegen=true            4081 / 5390        128.5           7.8       4.0X

With generated code for sample():

Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
range/sample/sum:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
range/sample/sum codegen=false         13688 / 17769         38.3          26.1       1.0X
range/sample/sum codegen=true            3908 / 3970        134.2           7.5       3.5X

The difference is insignificant too.

@davies
Copy link
Contributor

davies commented Mar 10, 2016

@viirya Thanks for the benchmark, it seems that there is no much difference between using the Java version and generated Java code, right? Then, we should go with the simpler approach (the current one).

@davies
Copy link
Contributor

davies commented Mar 10, 2016

cc @mengxr

@viirya
Copy link
Member Author

viirya commented Mar 13, 2016

ping @mengxr @rxin

@viirya
Copy link
Member Author

viirya commented Mar 15, 2016

ping @rxin @mengxr Can you review this to see if it is ok? Thanks.

private val lnq = math.log1p(-f)

/** Return true if the next item should be sampled. Otherwise, return false. */
def sample(): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return int (to be consistent with others)?

@SparkQA
Copy link

SparkQA commented Mar 16, 2016

Test build #53280 has finished for PR 11578 at commit 980a963.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Mar 17, 2016

What is the sampling probability in your benchmark? Gap sampling is only useful when the sampling probability is small. Otherwise, we still need to generate many random numbers, which is probably more expensive than an iterator call.

@viirya
Copy link
Member Author

viirya commented Mar 17, 2016

Both are 0.8.

@viirya
Copy link
Member Author

viirya commented Mar 17, 2016

Do we generate more random numbers than iterator call? I think it should be the same. Besides, the non-iterator api should only be used in Sampler codegen #11517. It is expensive to create a new iterator from values passed from Sampler node's child operator.

@viirya
Copy link
Member Author

viirya commented Mar 24, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54014 has finished for PR 11578 at commit d51d553.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54023 has finished for PR 11578 at commit d51d553.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 24, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 24, 2016

Test build #54047 has finished for PR 11578 at commit d51d553.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 25, 2016

retest this please.

1 similar comment
@viirya
Copy link
Member Author

viirya commented Mar 25, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54129 has finished for PR 11578 at commit d51d553.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54141 has finished for PR 11578 at commit d51d553.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 25, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54153 has finished for PR 11578 at commit d51d553.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 25, 2016

this time it failed at pyspark GaussianMixtureModel...

@viirya
Copy link
Member Author

viirya commented Mar 25, 2016

retest this please.

1 similar comment
@viirya
Copy link
Member Author

viirya commented Mar 25, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54163 has finished for PR 11578 at commit d51d553.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54170 has finished for PR 11578 at commit d51d553.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54179 has finished for PR 11578 at commit 5319ced.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 27, 2016

Test build #54278 has finished for PR 11578 at commit ef1be44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class GapSamplingReplacement(

@viirya
Copy link
Member Author

viirya commented Mar 27, 2016

@mengxr @rxin @nongli @holdenk The tests are passed. Iterator and non-iterator methods are now using same codes. Please take a look this. Thanks!


override def sample(items: Iterator[T]): Iterator[T] = {
if (ub - lb <= 0.0) {
if (complement) items else Iterator.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to optimize this corner case, it's fine to use the default sample

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Let me remove it.

@SparkQA
Copy link

SparkQA commented Mar 28, 2016

Test build #54305 has finished for PR 11578 at commit b8e031b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Mar 28, 2016

LGTM

@davies
Copy link
Contributor

davies commented Mar 28, 2016

Merging into master, thanks!

@asfgit asfgit closed this in 68c0c46 Mar 28, 2016
@viirya viirya deleted the random-sampler-no-iterator branch December 27, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants