Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Mar 4, 2016

JIRA: https://issues.apache.org/jira/browse/SPARK-13674

What changes were proposed in this pull request?

Sample operator doesn't support wholestage codegen now. This pr is to add support to it.

How was this patch tested?

A test is added into BenchmarkWholeStageCodegen. Besides, all tests should be passed.

range/sample withRep.: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
range/sample withRep. codegen=false 149 / 238 0.3 2908.4 1.0X
range/sample withRep. codegen=true 192 / 206 0.3 3751.7 0.8X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance regression should be due to the current implementation of two samplers used in Sample operator. The samplers take iterator and return iterator. However, since we consume individual elements from parent operator in Sample, now this change needs to create another iterator from these elements.

I will try to implement another version of the two samplers without iterator. It should improve the performance here.

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52460 has finished for PR 11517 at commit f14393a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2016

Test build #52546 has finished for PR 11517 at commit 7d50f5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Return how many times the next item will be sampled. Return 0 if it is not sampled.
*/
def sample(): Int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes on RandomSampler is submitted in #11578.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #11578 is merged, I will rebase this.

@viirya
Copy link
Member Author

viirya commented Mar 9, 2016

cc @davies @nongli

@SparkQA
Copy link

SparkQA commented Mar 9, 2016

Test build #52726 has finished for PR 11517 at commit 6144810.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class GapSampling(
    • trait PoissonGE
    • class GapSamplingReplacementIterator[T: ClassTag](
    • class GapSamplingReplacement(

@SparkQA
Copy link

SparkQA commented Mar 21, 2016

Test build #53665 has finished for PR 11517 at commit 0cb714b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


override def setSeed(seed: Long): Unit = rng.setSeed(seed)

override def sample(): Int = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not have the code duplication and write the iterator version based on this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. As the discussion in #11578, I am just not sure if we want to remove the iterator-based implementation.

ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 28, 2016
JIRA: https://issues.apache.org/jira/browse/SPARK-13742

## What changes were proposed in this pull request?

`RandomSampler.sample` currently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen of `Sampler` operator apache#11517. This change is to add non-iterator interface to `RandomSampler`.

This change adds a new method `def sample(): Int` to the trait `RandomSampler`. As we don't need to know the actual values of the sampling items, so this new method takes no arguments.

This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled.

For `BernoulliSampler` and `BernoulliCellSampler`, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not.

For `PoissonSampler`, the returned value can be more than 1, meaning the next item will be sampled multiple times.

## How was this patch tested?

Tests are added into `RandomSamplerSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#11578 from viirya/random-sampler-no-iterator.
…mple

Conflicts:
	core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala
@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54409 has finished for PR 11517 at commit 28073d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 29, 2016

@davies This is rebased. Please take a look. Thanks.

| $initTerm = true;
| if ($input.hasNext()) {
| initRange(((InternalRow) $input.next()).getInt(0));
| if (partitionIndex != -1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@davies
Copy link
Contributor

davies commented Mar 31, 2016

@viirya Could you also add numOutputRow for Sample ?

ctx.addMutableState(s"$samplerClass<UnsafeRow>", sampler,
s"$initSampler();")

val random = ctx.freshName("random")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are used in a separate function, we don't need to generate fresh name for them.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54614 has finished for PR 11517 at commit fa51f62.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class PoissonSampler[T](

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54618 has finished for PR 11517 at commit 76be6cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 31, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54620 has finished for PR 11517 at commit 76be6cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Mar 31, 2016

Not sure why the test failed..

@viirya
Copy link
Member Author

viirya commented Mar 31, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54629 has finished for PR 11517 at commit 76be6cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54686 has finished for PR 11517 at commit ef588db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
}

ignore("sort merge join/sample") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may do not want this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Let me remove it.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54693 has finished for PR 11517 at commit 12e1b37.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 1, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54690 has finished for PR 11517 at commit 6dfecf1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BatchPythonEvaluation(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan)
    • // enable memo iff we serialize the row with schema (schema and class should be memorized)

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54696 has finished for PR 11517 at commit 12e1b37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Apr 1, 2016

Chatted with @mengxr , it's OK to remove the class tag.

LGTM, merging this into master, thanks!

@asfgit asfgit closed this in 3e991db Apr 1, 2016
@viirya viirya deleted the add-wholestage-sample branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants