Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ filtering, dimensionality reduction, as well as underlying optimization primitiv

* [Data types](mllib-basics.html)
* [Basic statistics](mllib-stats.html)
* data generators
* random data generation
* stratified sampling
* summary statistics
* hypothesis testing
Expand Down
74 changes: 73 additions & 1 deletion docs/mllib-stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,79 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality
\newcommand{\zero}{\mathbf{0}}
\]`

## Data Generators
## Random data generation

Random data generation is useful for randomized algorithms, prototyping, and performance testing.
MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution:
uniform, standard normal, or Poisson.

<div class="codetabs">
<div data-lang="scala" markdown="1">
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"methods to generate random double RDDs or vector RDDs": should we mention that a user can extend RandomDataGenerator and generate a random RDD of whatever custom object they want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked RandomDataGenerator as a developer API and didn't mention it in the guide.

The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.

{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._

val sc: SparkContext = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
{% endhighlight %}
</div>

<div data-lang="java" markdown="1">
[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.

{% highlight java %}
import org.apache.spark.SparkContext;
import org.apache.spark.api.JavaDoubleRDD;
import static org.apache.spark.mllib.random.RandomRDDs.*;

JavaSparkContext jsc = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
// Apply a transform to get a random double RDD following `N(1, 4)`.
JavaDoubleRDD v = u.map(
new Function<Double, Double>() {
public Double call(Double x) {
return 1.0 + 2.0 * x;
}
});
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.

{% highlight python %}
from pyspark.mllib.random import RandomRDDs

sc = ... # SparkContext

# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
# Apply a transform to get a random double RDD following `N(1, 4)`.
v = u.map(lambda x: 1.0 + 2.0 * x)
{% endhighlight %}
</div>

</div>

## Stratified Sampling

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@ package org.apache.spark.mllib.random
import cern.jet.random.Poisson
import cern.jet.random.engine.DRand

import org.apache.spark.annotation.Experimental
import org.apache.spark.annotation.DeveloperApi
import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom}

/**
* :: Experimental ::
* :: DeveloperApi ::
* Trait for random data generators that generate i.i.d. data.
*/
@Experimental
@DeveloperApi
trait RandomDataGenerator[T] extends Pseudorandom with Serializable {

/**
Expand All @@ -43,10 +43,10 @@ trait RandomDataGenerator[T] extends Pseudorandom with Serializable {
}

/**
* :: Experimental ::
* :: DeveloperApi ::
* Generates i.i.d. samples from U[0.0, 1.0]
*/
@Experimental
@DeveloperApi
class UniformGenerator extends RandomDataGenerator[Double] {

// XORShiftRandom for better performance. Thread safety isn't necessary here.
Expand All @@ -62,10 +62,10 @@ class UniformGenerator extends RandomDataGenerator[Double] {
}

/**
* :: Experimental ::
* :: DeveloperApi ::
* Generates i.i.d. samples from the standard normal distribution.
*/
@Experimental
@DeveloperApi
class StandardNormalGenerator extends RandomDataGenerator[Double] {

// XORShiftRandom for better performance. Thread safety isn't necessary here.
Expand All @@ -81,12 +81,12 @@ class StandardNormalGenerator extends RandomDataGenerator[Double] {
}

/**
* :: Experimental ::
* :: DeveloperApi ::
* Generates i.i.d. samples from the Poisson distribution with the given mean.
*
* @param mean mean for the Poisson distribution.
*/
@Experimental
@DeveloperApi
class PoissonGenerator(val mean: Double) extends RandomDataGenerator[Double] {

private var rng = new Poisson(mean, new DRand)
Expand Down
Loading