-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13674][SQL] Add wholestage codegen support to Sample #11517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| range/sample withRep.: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
| ------------------------------------------------------------------------------------------- | ||
| range/sample withRep. codegen=false 149 / 238 0.3 2908.4 1.0X | ||
| range/sample withRep. codegen=true 192 / 206 0.3 3751.7 0.8X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance regression should be due to the current implementation of two samplers used in Sample operator. The samplers take iterator and return iterator. However, since we consume individual elements from parent operator in Sample, now this change needs to create another iterator from these elements.
I will try to implement another version of the two samplers without iterator. It should improve the performance here.
|
Test build #52460 has finished for PR 11517 at commit
|
|
Test build #52546 has finished for PR 11517 at commit
|
| * Return how many times the next item will be sampled. Return 0 if it is not sampled. | ||
| */ | ||
| def sample(): Int | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes on RandomSampler is submitted in #11578.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If #11578 is merged, I will rebase this.
|
Test build #52726 has finished for PR 11517 at commit
|
|
Test build #53665 has finished for PR 11517 at commit
|
|
|
||
| override def setSeed(seed: Long): Unit = rng.setSeed(seed) | ||
|
|
||
| override def sample(): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not have the code duplication and write the iterator version based on this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can. As the discussion in #11578, I am just not sure if we want to remove the iterator-based implementation.
JIRA: https://issues.apache.org/jira/browse/SPARK-13742 ## What changes were proposed in this pull request? `RandomSampler.sample` currently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen of `Sampler` operator apache#11517. This change is to add non-iterator interface to `RandomSampler`. This change adds a new method `def sample(): Int` to the trait `RandomSampler`. As we don't need to know the actual values of the sampling items, so this new method takes no arguments. This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled. For `BernoulliSampler` and `BernoulliCellSampler`, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not. For `PoissonSampler`, the returned value can be more than 1, meaning the next item will be sampled multiple times. ## How was this patch tested? Tests are added into `RandomSamplerSuite`. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes apache#11578 from viirya/random-sampler-no-iterator.
…mple Conflicts: core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala
|
Test build #54409 has finished for PR 11517 at commit
|
|
@davies This is rebased. Please take a look. Thanks. |
| | $initTerm = true; | ||
| | if ($input.hasNext()) { | ||
| | initRange(((InternalRow) $input.next()).getInt(0)); | ||
| | if (partitionIndex != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
|
@viirya Could you also add |
| ctx.addMutableState(s"$samplerClass<UnsafeRow>", sampler, | ||
| s"$initSampler();") | ||
|
|
||
| val random = ctx.freshName("random") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these are used in a separate function, we don't need to generate fresh name for them.
|
Test build #54614 has finished for PR 11517 at commit
|
…mple Conflicts: project/MimaExcludes.scala
|
Test build #54618 has finished for PR 11517 at commit
|
|
retest this please. |
|
Test build #54620 has finished for PR 11517 at commit
|
|
Not sure why the test failed.. |
|
retest this please. |
|
Test build #54629 has finished for PR 11517 at commit
|
…mple Conflicts: project/MimaExcludes.scala
|
Test build #54686 has finished for PR 11517 at commit
|
| */ | ||
| } | ||
|
|
||
| ignore("sort merge join/sample") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may do not want this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Let me remove it.
|
Test build #54693 has finished for PR 11517 at commit
|
|
retest this please. |
|
Test build #54690 has finished for PR 11517 at commit
|
|
Test build #54696 has finished for PR 11517 at commit
|
|
Chatted with @mengxr , it's OK to remove the class tag. LGTM, merging this into master, thanks! |
JIRA: https://issues.apache.org/jira/browse/SPARK-13674
What changes were proposed in this pull request?
Sample operator doesn't support wholestage codegen now. This pr is to add support to it.
How was this patch tested?
A test is added into
BenchmarkWholeStageCodegen. Besides, all tests should be passed.