-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13742][Core] Add non-iterator interface to RandomSampler #11578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #52669 has finished for PR 11578 at commit
|
|
Test build #52710 has finished for PR 11578 at commit
|
| if (ub - lb <= 0.0) { | ||
| if (complement) 1 else 0 | ||
| } else { | ||
| if (complement) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified as:
val x = rng.nextDouble()
val n = if ((x >= lb) && (x < ub)) 1 else 0
if (complement) 1 - n else n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. Will change it.
|
@viirya I'm wondering that should we generate the code for these |
|
@davies Sure. I will benchmark that today. |
|
Note: this benchmark is wrong. Updated below. @davies I just benchmark that using generated code for sample() with #11517. When Without generated code for sample(): With generated code for sample(): Looks like it is faster with generated code. I will benchmark for |
|
When Without generated code for sample(): With generated code for sample(): So when |
|
|
|
@davies What do your think? As the performance difference is insignificant, do we want to use generated codes for sample() in wholestage codegen Sample operator? |
|
Test build #52812 has finished for PR 11578 at commit
|
|
I made a mistake in previous benchmark for Update benchmark here. Without generated code for sample(): With generated code for sample(): The difference is insignificant too. |
|
@viirya Thanks for the benchmark, it seems that there is no much difference between using the Java version and generated Java code, right? Then, we should go with the simpler approach (the current one). |
|
cc @mengxr |
| private val lnq = math.log1p(-f) | ||
|
|
||
| /** Return true if the next item should be sampled. Otherwise, return false. */ | ||
| def sample(): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return int (to be consistent with others)?
|
Test build #53280 has finished for PR 11578 at commit
|
|
What is the sampling probability in your benchmark? Gap sampling is only useful when the sampling probability is small. Otherwise, we still need to generate many random numbers, which is probably more expensive than an iterator call. |
|
Both are 0.8. |
|
Do we generate more random numbers than iterator call? I think it should be the same. Besides, the non-iterator api should only be used in Sampler codegen #11517. It is expensive to create a new iterator from values passed from Sampler node's child operator. |
|
retest this please. |
|
Test build #54014 has finished for PR 11578 at commit
|
|
Test build #54023 has finished for PR 11578 at commit
|
|
retest this please. |
|
Test build #54047 has finished for PR 11578 at commit
|
|
retest this please. |
1 similar comment
|
retest this please. |
|
Test build #54129 has finished for PR 11578 at commit
|
|
Test build #54141 has finished for PR 11578 at commit
|
|
retest this please. |
|
Test build #54153 has finished for PR 11578 at commit
|
|
this time it failed at pyspark GaussianMixtureModel... |
|
retest this please. |
1 similar comment
|
retest this please. |
|
Test build #54163 has finished for PR 11578 at commit
|
|
Test build #54170 has finished for PR 11578 at commit
|
|
Test build #54179 has finished for PR 11578 at commit
|
|
Test build #54278 has finished for PR 11578 at commit
|
|
|
||
| override def sample(items: Iterator[T]): Iterator[T] = { | ||
| if (ub - lb <= 0.0) { | ||
| if (complement) items else Iterator.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to optimize this corner case, it's fine to use the default sample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. Let me remove it.
|
Test build #54305 has finished for PR 11578 at commit
|
|
LGTM |
|
Merging into master, thanks! |
JIRA: https://issues.apache.org/jira/browse/SPARK-13742
What changes were proposed in this pull request?
RandomSampler.samplecurrently accepts iterator as input and output another iterator. This makes it inappropriate to use in wholestage codegen ofSampleroperator #11517. This change is to add non-iterator interface toRandomSampler.This change adds a new method
def sample(): Intto the traitRandomSampler. As we don't need to know the actual values of the sampling items, so this new method takes no arguments.This method will decide whether to sample the next item or not. It returns how many times the next item will be sampled.
For
BernoulliSamplerandBernoulliCellSampler, the returned sampling times can only be 0 or 1. It simply means whether to sample the next item or not.For
PoissonSampler, the returned value can be more than 1, meaning the next item will be sampled multiple times.How was this patch tested?
Tests are added into
RandomSamplerSuite.