##
###Microsoft.Spark.CSharp.Core.Accumulator ####Summary A shared variable that can be accumulated, i.e., has a commutative and associative "add"
operation. Worker tasks on a Spark cluster can add values to an Accumulator with the +=
operator, but only the driver program is allowed to access its value, using Value.
Updates from the workers get propagated automatically to the driver program.
While supports accumulators for primitive data types like int and
float, users can also define accumulators for custom types by providing a custom
object. Refer to the doctest of this module for an example.
See python implementation in accumulators.py, worker.py, PythonRDD.scala
####Methods
Name | Description |
---|---|
Adds a term to this accumulator's value | |
The += operator; adds a term to this accumulator's value | |
Creates and returns a string representation of the current accumulator | |
Provide a "zero value" for the type | |
Add two values of the accumulator's data type, returning a new value; |
###Microsoft.Spark.CSharp.Core.Accumulator`1 ####Summary
A generic version of where the element type is specified by the driver program.
The type of element in the accumulator.
####Methods
Name | Description |
---|---|
Add | Adds a term to this accumulator's value |
op_Addition | The += operator; adds a term to this accumulator's value |
ToString | Creates and returns a string representation of the current accumulator |
###Microsoft.Spark.CSharp.Core.AccumulatorParam`1 ####Summary
An AccumulatorParam that uses the + operators to add values. Designed for simple types
such as integers, floats, and lists. Requires the zero value for the underlying type
as a parameter.
####Methods
Name | Description |
---|---|
Zero | Provide a "zero value" for the type |
AddInPlace | Add two values of the accumulator's data type, returning a new value; |
###Microsoft.Spark.CSharp.Core.AccumulatorServer ####Summary
A simple TCP server that intercepts shutdown() in order to interrupt
our continuous polling on the handler.
###Microsoft.Spark.CSharp.Core.Broadcast ####Summary
A broadcast variable created with SparkContext.Broadcast().
Access its value through Value.
var b = sc.Broadcast(new int[] {1, 2, 3, 4, 5})
b.Value
[1, 2, 3, 4, 5]
sc.Parallelize(new in[] {0, 0}).FlatMap(x: b.Value).Collect()
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
b.Unpersist()
See python implementation in broadcast.py, worker.py, PythonRDD.scala
####Methods
Name | Description |
---|---|
Delete cached copies of this broadcast on the executors. |
###Microsoft.Spark.CSharp.Core.Broadcast`1 ####Summary
A generic version of where the element can be specified.
The type of element in Broadcast
####Methods
Name | Description |
---|---|
Unpersist | Delete cached copies of this broadcast on the executors. |
###Microsoft.Spark.CSharp.Core.CSharpWorkerFunc ####Summary
Function that will be executed in CSharpWorker
####Methods
Name | Description |
---|---|
Chain | Used to chain functions |
###Microsoft.Spark.CSharp.Core.Option`1 ####Summary
Container for an optional value of type T. If the value of type T is present, the Option.IsDefined is TRUE and GetValue() return the value.
If the value is absent, the Option.IsDefined is FALSE, exception will be thrown when calling GetValue().
####Methods
Name | Description |
---|---|
GetValue | Returns the value of the option if Option.IsDefined is TRUE; otherwise, throws an . |
###Microsoft.Spark.CSharp.Core.Partitioner ####Summary
An object that defines how the elements in a key-value pair RDD are partitioned by key.
Maps each key to a partition ID, from 0 to "numPartitions - 1".
####Methods
Name | Description |
---|---|
Equals | Determines whether the specified object is equal to the current object. |
GetHashCode | Serves as the default hash function. |
###Microsoft.Spark.CSharp.Core.RDDCollector ####Summary
Used for collect operation on RDD
###Microsoft.Spark.CSharp.Core.DoubleRDDFunctions ####Summary
Extra functions available on RDDs of Doubles through an implicit conversion.
####Methods
Name | Description |
---|---|
Sum | Add up the elements in this RDD. sc.Parallelize(new double[] {1.0, 2.0, 3.0}).Sum() 6.0 |
Stats | Return a object that captures the mean, variance and count of the RDD's elements in one operation. |
Histogram | Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets). Buckets must be sorted and not contain any duplicates, must be at least two elements. If `buckets` is a number, it will generates buckets which are evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket. It will return an tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60]) # evenly spaced buckets ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2]) |
Mean | Compute the mean of this RDD's elements. sc.Parallelize(new double[]{1, 2, 3}).Mean() 2.0 |
Variance | Compute the variance of this RDD's elements. sc.Parallelize(new double[]{1, 2, 3}).Variance() 0.666... |
Stdev | Compute the standard deviation of this RDD's elements. sc.Parallelize(new double[]{1, 2, 3}).Stdev() 0.816... |
SampleStdev | Compute the sample standard deviation of this RDD's elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N). sc.Parallelize(new double[]{1, 2, 3}).SampleStdev() 1.0 |
SampleVariance | Compute the sample variance of this RDD's elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N). sc.Parallelize(new double[]{1, 2, 3}).SampleVariance() 1.0 |
###Microsoft.Spark.CSharp.Core.IRDDCollector ####Summary
Interface for collect operation on RDD
###Microsoft.Spark.CSharp.Core.OrderedRDDFunctions ####Summary
Extra functions available on RDDs of (key, value) pairs where the key is sortable through
a function to sort the key.
####Methods
Name | Description |
---|---|
SortByKey``2 | Sorts this RDD, which is assumed to consist of KeyValuePair pairs. |
SortByKey``3 | Sorts this RDD, which is assumed to consist of KeyValuePairs. If key is type of string, case is sensitive. |
repartitionAndSortWithinPartitions``2 | Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery. |
###Microsoft.Spark.CSharp.Core.PairRDDFunctions ####Summary
operations only available to KeyValuePair RDD
See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
####Methods
Name | Description |
---|---|
CollectAsMap``2 | Return the key-value pairs in this RDD to the master as a dictionary. var m = sc.Parallelize(new[] { new KeyValuePair<int, int>(1, 2), new KeyValuePair<int, int>(3, 4) }, 1).CollectAsMap() m[1] 2 m[3] 4 |
Keys``2 | Return an RDD with the keys of each tuple. >>> m = sc.Parallelize(new[] { new KeyValuePair<int, int>(1, 2), new KeyValuePair<int, int>(3, 4) }, 1).Keys().Collect() [1, 3] |
Values``2 | Return an RDD with the values of each tuple. >>> m = sc.Parallelize(new[] { new KeyValuePair<int, int>(1, 2), new KeyValuePair<int, int>(3, 4) }, 1).Values().Collect() [2, 4] |
ReduceByKey``2 | Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with partitions, or the default parallelism level if is not specified. sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 1), new KeyValuePair<string, int>("a", 1) }, 2) .ReduceByKey((x, y) => x + y).Collect() [('a', 2), ('b', 1)] |
ReduceByKeyLocally``2 | Merge the values for each key using an associative reduce function, but return the results immediately to the master as a dictionary. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 1), new KeyValuePair<string, int>("a", 1) }, 2) .ReduceByKeyLocally((x, y) => x + y).Collect() [('a', 2), ('b', 1)] |
CountByKey``2 | Count the number of elements for each key, and return the result to the master as a dictionary. sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 1), new KeyValuePair<string, int>("a", 1) }, 2) .CountByKey((x, y) => x + y).Collect() [('a', 2), ('b', 1)] |
Join``3 | Return an RDD containing all pairs of elements with matching keys in this RDD and . Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this RDD and (k, v2) is in . Performs a hash join across the cluster. var l = sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 1); var r = sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 2), new KeyValuePair<string, int>("a", 3) }, 1); var m = l.Join(r, 2).Collect(); [('a', (1, 2)), ('a', (1, 3))] |
LeftOuterJoin``3 | Perform a left outer join of this RDD and . For each element (k, v) in this RDD, the resulting RDD will either contain all pairs (k, (v, Option)) for w in , where Option.IsDefined is TRUE, or the pair (k, (v, Option)) if no elements in have key k, where Option.IsDefined is FALSE. Hash-partitions the resulting RDD into the given number of partitions. var l = sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 1); var r = sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 2) }, 1); var m = l.LeftOuterJoin(r).Collect(); [('a', (1, 2)), ('b', (4, Option))] * Option.IsDefined = FALSE |
RightOuterJoin``3 | Perform a right outer join of this RDD and . For each element (k, w) in , the resulting RDD will either contain all pairs (k, (Option, w)) for v in this, where Option.IsDefined is TRUE, or the pair (k, (Option, w)) if no elements in this RDD have key k, where Option.IsDefined is FALSE. Hash-partitions the resulting RDD into the given number of partitions. var l = sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 2) }, 1); var r = sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 1); var m = l.RightOuterJoin(r).Collect(); [('a', (2, 1)), ('b', (Option, 4))] * Option.IsDefined = FALSE |
FullOuterJoin``3 | Perform a full outer join of this RDD and . For each element (k, v) in this RDD, the resulting RDD will either contain all pairs (k, (v, w)) for w in , or the pair (k, (v, None)) if no elements in have key k. Similarly, for each element (k, w) in , the resulting RDD will either contain all pairs (k, (v, w)) for v in this RDD, or the pair (k, (None, w)) if no elements in this RDD have key k. Hash-partitions the resulting RDD into the given number of partitions. var l = sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 1), KeyValuePair<string, int>("b", 4) }, 1); var r = sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 2), new KeyValuePair<string, int>("c", 8) }, 1); var m = l.FullOuterJoin(r).Collect(); [('a', (1, 2)), ('b', (4, None)), ('c', (None, 8))] |
PartitionBy``2 | Return a copy of the RDD partitioned using the specified partitioner. sc.Parallelize(new[] { 1, 2, 3, 4, 2, 4, 1 }, 1).Map(x => new KeyValuePair<int, int>(x, x)).PartitionBy(3).Glom().Collect() |
CombineByKey``3 | # TODO: add control over map-side aggregation Generic function to combine the elements for each key using a custom set of aggregation functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C. Note that V and C can be different -- for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]). Users provide three functions: - , which turns a V into a C (e.g., creates a one-element list) - , to merge a V into a C (e.g., adds it to the end of a list) - , to combine two C's into a single one. In addition, users can control the partitioning of the output RDD. sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 1), new KeyValuePair<string, int>("a", 1) }, 2) .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect() [('a', '11'), ('b', '1')] |
AggregateByKey``3 | Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U. sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 1), new KeyValuePair<string, int>("a", 1) }, 2) .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect() [('a', 2), ('b', 1)] |
FoldByKey``2 | Merge the values for each key using an associative function "func" and a neutral "zeroValue" which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.). sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 1), new KeyValuePair<string, int>("a", 1) }, 2) .CombineByKey(() => string.Empty, (x, y) => x + y.ToString(), (x, y) => x + y).Collect() [('a', 2), ('b', 1)] |
GroupByKey``2 | Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions. Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance. sc.Parallelize( new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 1), new KeyValuePair<string, int>("a", 1) }, 2) .GroupByKey().MapValues(l => string.Join(" ", l)).Collect() [('a', [1, 1]), ('b', [1])] |
MapValues``3 | Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning. sc.Parallelize( new[] { new KeyValuePair<string, string[]>("a", new[]{"apple", "banana", "lemon"}), new KeyValuePair<string, string[]>("b", new[]{"grapes"}) }, 2) .MapValues(x => x.Length).Collect() [('a', 3), ('b', 1)] |
FlatMapValues``3 | Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning. x = sc.Parallelize( new[] { new KeyValuePair<string, string[]>("a", new[]{"x", "y", "z"}), new KeyValuePair<string, string[]>("b", new[]{"p", "r"}) }, 2) .FlatMapValues(x => x).Collect() [('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')] |
MapPartitionsWithIndex``5 | explicitly convert KeyValuePair<K, V> to KeyValuePair<K, dynamic> since they are incompatibles types unlike V to dynamic |
GroupWith``3 | For each key k in this RDD or , return a resulting RDD that contains a tuple with the list of values for that key in this RDD as well as . var x = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 2); var y = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 2) }, 1); x.GroupWith(y).Collect(); [('a', ([1], [2])), ('b', ([4], []))] |
GroupWith``4 | var x = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 5), new KeyValuePair<string, int>("b", 6) }, 2); var y = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 2); var z = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 2) }, 1); x.GroupWith(y, z).Collect(); |
GroupWith``5 | var x = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 5), new KeyValuePair<string, int>("b", 6) }, 2); var y = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 1), new KeyValuePair<string, int>("b", 4) }, 2); var z = sc.Parallelize(new[] { new KeyValuePair<string, int>("a", 2) }, 1); var w = sc.Parallelize(new[] { new KeyValuePair<string, int>("b", 42) }, 1); var m = x.GroupWith(y, z, w).MapValues(l => string.Join(" ", l.Item1) + " : " + string.Join(" ", l.Item2) + " : " + string.Join(" ", l.Item3) + " : " + string.Join(" ", l.Item4)).Collect(); |
SubtractByKey``3 | Return each (key, value) pair in this RDD that has no pair with matching key in . var x = sc.Parallelize(new[] { new KeyValuePair<string, int?>("a", 1), new KeyValuePair<string, int?>("b", 4), new KeyValuePair<string, int?>("b", 5), new KeyValuePair<string, int?>("a", 2) }, 2); var y = sc.Parallelize(new[] { new KeyValuePair<string, int?>("a", 3), new KeyValuePair<string, int?>("c", null) }, 2); x.SubtractByKey(y).Collect(); [('b', 4), ('b', 5)] |
Lookup``2 | Return the list of values in the RDD for key `key`. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. >>> l = range(1000) >>> rdd = sc.Parallelize(Enumerable.Range(0, 1000).Zip(Enumerable.Range(0, 1000), (x, y) => new KeyValuePair<int, int>(x, y)), 10) >>> rdd.lookup(42) [42] |
SaveAsNewAPIHadoopDataset``2 | Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the new Hadoop OutputFormat API (mapreduce package). Keys/values are converted for output using either user specified converters or, by default, org.apache.spark.api.python.JavaToWritableConverter. |
SaveAsNewAPIHadoopFile``2 | |
SaveAsHadoopDataset``2 | Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package). Keys/values are converted for output using either user specified converters or, by default, org.apache.spark.api.python.JavaToWritableConverter. |
SaveAsHadoopFile``2 | Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the old Hadoop OutputFormat API (mapred package). Key and value types will be inferred if not specified. Keys and values are converted for output using either user specified converters or org.apache.spark.api.python.JavaToWritableConverter. The is applied on top of the base Hadoop conf associated with the SparkContext of this RDD to create a merged Hadoop MapReduce job configuration for saving the data. |
SaveAsSequenceFile``2 | Output a Python RDD of key-value pairs (of form RDD[(K, V)]) to any Hadoop file system, using the org.apache.hadoop.io.Writable types that we convert from the RDD's key and value types. The mechanism is as follows: 1. Pyrolite is used to convert pickled Python RDD into RDD of Java objects. 2. Keys and values of this Java RDD are converted to Writables and written out. |
NullIfEmpty``1 | Converts a collection to a list where the element type is Option(T) type. If the collection is empty, just returns the empty list. |
###Microsoft.Spark.CSharp.Core.PipelinedRDD`1 ####Summary
Wraps C#-based transformations that can be executed within a stage. It helps avoid unnecessary Ser/De of data between
JVM and CLR to execute C# transformations and pipelines them
####Methods
Name | Description |
---|---|
MapPartitionsWithIndex``1 | Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. |
###Microsoft.Spark.CSharp.Core.PriorityQueue`1 ####Summary
A bounded priority queue implemented with max binary heap.
Construction steps:
1. Build a Max Heap of the first k elements.
2. For each element after the kth element, compare it with the root of the max heap,
a. If the element is less than the root, replace root with this element, heapify.
b. Else ignore it.
####Methods
Name | Description |
---|---|
Offer | Inserts the specified element into this priority queue. |
###Microsoft.Spark.CSharp.Core.Profiler ####Summary
A class represents a profiler
###Microsoft.Spark.CSharp.Core.RDD`1 ####Summary
Represents a Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel
See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
Type of the RDD
####Methods
Name | Description |
---|---|
Cache | Persist this RDD with the default storage level . |
Persist | Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. If no storage level is specified defaults to . sc.Parallelize(new string[] {"b", "a", "c").Persist().isCached True |
Unpersist | Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. |
Checkpoint | Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with ) and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation. |
GetNumPartitions | Returns the number of partitions of this RDD. |
Map``1 | Return a new RDD by applying a function to each element of this RDD. sc.Parallelize(new string[]{"b", "a", "c"}, 1).Map(x => new KeyValuePair<string, int>(x, 1)).Collect() [('a', 1), ('b', 1), ('c', 1)] |
FlatMap``1 | Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. sc.Parallelize(new int[] {2, 3, 4}, 1).FlatMap(x => Enumerable.Range(1, x - 1)).Collect() [1, 1, 1, 2, 2, 3] |
MapPartitions``1 | Return a new RDD by applying a function to each partition of this RDD. sc.Parallelize(new int[] {1, 2, 3, 4}, 2).MapPartitions(iter => new[]{iter.Sum(x => (x as decimal?))}).Collect() [3, 7] |
MapPartitionsWithIndex``1 | Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. sc.Parallelize(new int[]{1, 2, 3, 4}, 4).MapPartitionsWithIndex<double>((pid, iter) => (double)pid).Sum() 6 |
Filter | Return a new RDD containing only the elements that satisfy a predicate. sc.Parallelize(new int[]{1, 2, 3, 4, 5}, 1).Filter(x => x % 2 == 0).Collect() [2, 4] |
Distinct | Return a new RDD containing the distinct elements in this RDD. >>> sc.Parallelize(new int[] {1, 1, 2, 3}, 1).Distinct().Collect() [1, 2, 3] |
Sample | Return a sampled subset of this RDD. var rdd = sc.Parallelize(Enumerable.Range(0, 100), 4) 6 <= rdd.Sample(False, 0.1, 81).count() <= 14 true |
RandomSplit | Randomly splits this RDD with the provided weights. var rdd = sc.Parallelize(Enumerable.Range(0, 500), 1) var rdds = rdd.RandomSplit(new double[] {2, 3}, 17) 150 < rdds[0].Count() < 250 250 < rdds[1].Count() < 350 |
TakeSample | Return a fixed-size sampled subset of this RDD. var rdd = sc.Parallelize(Enumerable.Range(0, 10), 2) rdd.TakeSample(true, 20, 1).Length 20 rdd.TakeSample(false, 5, 2).Length 5 rdd.TakeSample(false, 15, 3).Length 10 |
ComputeFractionForSampleSize | Returns a sampling rate that guarantees a sample of size >= sampleSizeLowerBound 99.99% of the time. How the sampling rate is determined: Let p = num / total, where num is the sample size and total is the total number of data points in the RDD. We're trying to compute q > p such that - when sampling with replacement, we're drawing each data point with prob_i ~ Pois(q), where we want to guarantee Pr[s < num] < 0.0001 for s = sum(prob_i for i from 0 to total), i.e. the failure rate of not having a sufficiently large sample < 0.0001. Setting q = p + 5 * sqrt(p/total) is sufficient to guarantee 0.9999 success rate for num > 12, but we need a slightly larger q (9 empirically determined). - when sampling without replacement, we're drawing each data point with prob_i ~ Binomial(total, fraction) and our choice of q guarantees 1-delta, or 0.9999 success rate, where success rate is defined the same as in sampling with replacement. |
Union | Return the union of this RDD and another one. var rdd = sc.Parallelize(new int[] { 1, 1, 2, 3 }, 1) rdd.union(rdd).collect() [1, 1, 2, 3, 1, 1, 2, 3] |
Intersection | Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did. Note that this method performs a shuffle internally. var rdd1 = sc.Parallelize(new int[] { 1, 10, 2, 3, 4, 5 }, 1) var rdd2 = sc.Parallelize(new int[] { 1, 6, 2, 3, 7, 8 }, 1) var rdd1.Intersection(rdd2).Collect() [1, 2, 3] |
Glom | Return an RDD created by coalescing all elements within each partition into a list. var rdd = sc.Parallelize(new int[] { 1, 2, 3, 4 }, 2) rdd.Glom().Collect() [[1, 2], [3, 4]] |
Cartesian``1 | Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other. rdd = sc.Parallelize(new int[] { 1, 2 }, 1) rdd.Cartesian(rdd).Collect() [(1, 1), (1, 2), (2, 1), (2, 2)] |
GroupBy``1 | Return an RDD of grouped items. Each group consists of a key and a sequence of elements mapping to that key. The ordering of elements within each group is not guaranteed, and may even differ each time the resulting RDD is evaluated. Note: This operation may be very expensive. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] or [[PairRDDFunctions.reduceByKey]] will provide much better performance. >>> rdd = sc.Parallelize(new int[] { 1, 1, 2, 3, 5, 8 }, 1) >>> result = rdd.GroupBy(lambda x: x % 2).Collect() [(0, [2, 8]), (1, [1, 1, 3, 5])] |
Pipe | Return an RDD created by piping elements to a forked external process. >>> sc.Parallelize(new char[] { '1', '2', '3', '4' }, 1).Pipe("cat").Collect() [u'1', u'2', u'3', u'4'] |
Foreach | Applies a function to all elements of this RDD. sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).Foreach(x => Console.Write(x)) |
ForeachPartition | Applies a function to each partition of this RDD. sc.parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).ForeachPartition(iter => { foreach (var x in iter) Console.Write(x + " "); }) |
Collect | Return a list that contains all of the elements in this RDD. |
Reduce | Reduces the elements of this RDD using the specified commutative and associative binary operator. sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 1).Reduce((x, y) => x + y) 15 |
TreeReduce | Reduces the elements of this RDD in a multi-level tree pattern. >>> add = lambda x, y: x + y >>> rdd = sc.Parallelize(new int[] { -5, -4, -3, -2, -1, 1, 2, 3, 4 }, 10).TreeReduce((x, y) => x + y)) >>> rdd.TreeReduce(add) -5 >>> rdd.TreeReduce(add, 1) -5 >>> rdd.TreeReduce(add, 2) -5 >>> rdd.TreeReduce(add, 5) -5 >>> rdd.TreeReduce(add, 10) -5 |
Fold | Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral "zero value." The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection. >>> from operator import add >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add) 15 |
Aggregate``1 | Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value." The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2. The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U >>> sc.parallelize(new int[] { 1, 2, 3, 4 }, 1).Aggregate(0, (x, y) => x + y, (x, y) => x + y)) 10 |
TreeAggregate``1 | Aggregates the elements of this RDD in a multi-level tree pattern. rdd = sc.Parallelize(new int[] { 1, 2, 3, 4 }, 1).TreeAggregate(0, (x, y) => x + y, (x, y) => x + y)) 10 |
Count | Return the number of elements in this RDD. |
CountByValue | Return the count of each unique value in this RDD as a dictionary of (value, count) pairs. sc.Parallelize(new int[] { 1, 2, 1, 2, 2 }, 2).CountByValue()) [(1, 2), (2, 3)] |
Take | Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit. Translated from the Scala implementation in RDD#take(). sc.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Cache().Take(2))) [2, 3] sc.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Take(10) [2, 3, 4, 5, 6] sc.Parallelize(Enumerable.Range(0, 100), 100).Filter(x => x > 90).Take(3) [91, 92, 93] |
First | Return the first element in this RDD. >>> sc.Parallelize(new int[] { 2, 3, 4 }, 2).First() 2 |
IsEmpty | Returns true if and only if the RDD contains no elements at all. Note that an RDD may be empty even when it has at least 1 partition. sc.Parallelize(new int[0], 1).isEmpty() true sc.Parallelize(new int[] {1}).isEmpty() false |
Subtract | Return each value in this RDD that is not contained in . var x = sc.Parallelize(new int[] { 1, 2, 3, 4 }, 1) var y = sc.Parallelize(new int[] { 3 }, 1) x.Subtract(y).Collect()) [1, 2, 4] |
KeyBy``1 | Creates tuples of the elements in this RDD by applying . sc.Parallelize(new int[] { 1, 2, 3, 4 }, 1).KeyBy(x => x * x).Collect()) (1, 1), (4, 2), (9, 3), (16, 4) |
Repartition | Return a new RDD that has exactly numPartitions partitions. Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using `Coalesce`, which can avoid performing a shuffle. var rdd = sc.Parallelize(new int[] { 1, 2, 3, 4, 5, 6, 7 }, 4) rdd.Glom().Collect().Length 4 rdd.Repartition(2).Glom().Collect().Length 2 |
Coalesce | Return a new RDD that is reduced into `numPartitions` partitions. sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 3).Glom().Collect().Length 3 >>> sc.Parallelize(new int[] { 1, 2, 3, 4, 5 }, 3).Coalesce(1).Glom().Collect().Length 1 |
Zip``1 | Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. Assumes that the two RDDs have the same number of partitions and the same number of elements in each partition (e.g. one was made through a map on the other). var x = sc.parallelize(range(0,5)) var y = sc.parallelize(range(1000, 1005)) x.Zip(y).Collect() [(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)] |
ZipWithIndex | Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This method needs to trigger a spark job when this RDD contains more than one partitions. sc.Parallelize(new string[] { "a", "b", "c", "d" }, 3).ZipWithIndex().Collect() [('a', 0), ('b', 1), ('c', 2), ('d', 3)] |
ZipWithUniqueId | Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k, 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won't trigger a spark job, which is different from >>> sc.Parallelize(new string[] { "a", "b", "c", "d" }, 1).ZipWithIndex().Collect() [('a', 0), ('b', 1), ('c', 4), ('d', 2), ('e', 5)] |
SetName | Assign a name to this RDD. >>> rdd1 = sc.parallelize([1, 2]) >>> rdd1.setName('RDD1').name() u'RDD1' |
ToDebugString | A description of this RDD and its recursive dependencies for debugging. |
GetStorageLevel | Get the RDD's current storage level. >>> rdd1 = sc.parallelize([1,2]) >>> rdd1.getStorageLevel() StorageLevel(False, False, False, False, 1) >>> print(rdd1.getStorageLevel()) Serialized 1x Replicated |
ToLocalIterator | Return an iterator that contains all of the elements in this RDD. The iterator will consume as much memory as the largest partition in this RDD. sc.Parallelize(Enumerable.Range(0, 10), 1).ToLocalIterator() [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] |
RandomSampleWithRange | Internal method exposed for Random Splits in DataFrames. Samples an RDD given a probability range. |
###Microsoft.Spark.CSharp.Core.StringRDDFunctions ####Summary
Some useful utility functions for RDD{string}
####Methods
Name | Description |
---|---|
SaveAsTextFile | Save this RDD as a text file, using string representations of elements. |
###Microsoft.Spark.CSharp.Core.ComparableRDDFunctions ####Summary
Some useful utility functions for RDD's containing IComparable values.
####Methods
Name | Description |
---|---|
Max``1 | Find the maximum item in this RDD. sc.Parallelize(new double[] { 1.0, 5.0, 43.0, 10.0 }, 2).Max() 43.0 |
Min``1 | Find the minimum item in this RDD. sc.Parallelize(new double[] { 2.0, 5.0, 43.0, 10.0 }, 2).Min() >>> rdd.min() 2.0 |
TakeOrdered``1 | Get the N elements from a RDD ordered in ascending order or as specified by the optional key function. sc.Parallelize(new int[] { 10, 1, 2, 9, 3, 4, 5, 6, 7 }, 2).TakeOrdered(6) [1, 2, 3, 4, 5, 6] |
Top``1 | Get the top N elements from a RDD. Note: It returns the list sorted in descending order. sc.Parallelize(new int[] { 2, 3, 4, 5, 6 }, 2).Top(3) [6, 5, 4] |
###Microsoft.Spark.CSharp.Core.SparkConf ####Summary
Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified
by the user. Spark does not support modifying the configuration at runtime.
See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkConf
####Methods
Name | Description |
---|---|
SetMaster | The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster. |
SetAppName | Set a name for your application. Shown in the Spark web UI. |
SetSparkHome | Set the location where Spark is installed on worker nodes. |
Set | Set the value of a string config |
GetInt | Get a int parameter value, falling back to a default if not set |
Get | Get a string parameter value, falling back to a default if not set |
GetAll | Get all parameters as a list of pairs |
###Microsoft.Spark.CSharp.Core.SparkContext ####Summary
Main entry point for Spark functionality. A SparkContext represents the
connection to a Spark cluster, and can be used to create RDDs, accumulators
and broadcast variables on that cluster.
####Methods
Name | Description |
---|---|
GetActiveSparkContext | Get existing SparkContext |
GetConf | Return a copy of this JavaSparkContext's configuration. The configuration ''cannot'' be changed at runtime. |
GetOrCreate | This function may be used to get or instantiate a SparkContext and register it as a singleton object. Because we can only have one active SparkContext per JVM, this is useful when applications may wish to share a SparkContext. Note: This function cannot be used to create multiple SparkContext instances even if multiple contexts are allowed. |
TextFile | Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. |
Parallelize``1 | Distribute a local collection to form an RDD. sc.Parallelize(new int[] {0, 2, 3, 4, 6}, 5).Glom().Collect() [[0], [2], [3], [4], [6]] |
EmptyRDD | Create an RDD that has no partitions or elements. |
WholeTextFiles | Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. For example, if you have the following files: {{{ hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn }}} Do {{{ RDD<KeyValuePair<string, string>> rdd = sparkContext.WholeTextFiles("hdfs://a-hdfs-path") }}} then `rdd` contains {{{ (a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content) }}} Small files are preferred, large file is also allowable, but may cause bad performance. minPartitions A suggestion value of the minimal splitting number for input data. |
BinaryFiles | Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file. For example, if you have the following files: {{{ hdfs://a-hdfs-path/part-00000 hdfs://a-hdfs-path/part-00001 ... hdfs://a-hdfs-path/part-nnnnn }}} Do RDD<KeyValuePair<string, byte[]>>"/> rdd = sparkContext.dataStreamFiles("hdfs://a-hdfs-path")`, then `rdd` contains {{{ (a-hdfs-path/part-00000, its content) (a-hdfs-path/part-00001, its content) ... (a-hdfs-path/part-nnnnn, its content) }}} @note Small files are preferred; very large files but may cause bad performance. @param minPartitions A suggestion value of the minimal splitting number for input data. |
SequenceFile | Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is as follows: 1. A Java RDD is created from the SequenceFile or other InputFormat, and the key and value Writable classes 2. Serialization is attempted via Pyrolite pickling 3. If this fails, the fallback is to call 'toString' on each key and value 4. PickleSerializer is used to deserialize pickled objects on the Python side |
NewAPIHadoopFile | Read a 'new API' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java |
NewAPIHadoopRDD | Read a 'new API' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile. |
HadoopFile | Read an 'old' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is the same as for sc.sequenceFile. A Hadoop configuration can be passed in as a Python dict. This will be converted into a Configuration in Java. |
HadoopRDD | Read an 'old' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. This will be converted into a Configuration in Java. The mechanism is the same as for sc.sequenceFile. |
Union``1 | Build the union of a list of RDDs. This supports unions() of RDDs with different serialized formats, although this forces them to be reserialized using the default serializer: >>> path = os.path.join(tempdir, "union-text.txt") >>> with open(path, "w") as testFile: ... _ = testFile.write("Hello") >>> textFile = sc.textFile(path) >>> textFile.collect() [u'Hello'] >>> parallelized = sc.parallelize(["World!"]) >>> sorted(sc.union([textFile, parallelized]).collect()) [u'Hello', 'World!'] |
Broadcast``1 | Broadcast a read-only variable to the cluster, returning a Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once. |
Accumulator``1 | Create an with the given initial value, using a given helper object to define how to add values of the data type if provided. Default AccumulatorParams are used for integers and floating-point numbers if you do not provide one. For other types, a custom AccumulatorParam can be used. |
Stop | Shut down the SparkContext. |
AddFile | Add a file to be downloaded with this Spark job on every node. The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use `SparkFiles.get(fileName)` to find its download location. |
SetCheckpointDir | Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster. |
SetJobGroup | Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared. Often, a unit of execution in an application consists of multiple Spark actions or jobs. Application programmers can use this method to group all those jobs together and give a group description. Once set, the Spark web UI will associate such jobs with this group. The application can also use [[org.apache.spark.api.java.JavaSparkContext.cancelJobGroup]] to cancel all running jobs in this group. For example, {{{ // In the main thread: sc.setJobGroup("some_job_to_cancel", "some job description"); rdd.map(...).count(); // In a separate thread: sc.cancelJobGroup("some_job_to_cancel"); }}} If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads. This is useful to help ensure that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208, where HDFS may respond to Thread.interrupt() by marking nodes as dead. |
SetLocalProperty | Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool. |
GetLocalProperty | Get a local property set in this thread, or null if it is missing. See [[org.apache.spark.api.java.JavaSparkContext.setLocalProperty]]. |
SetLogLevel | Control our logLevel. This overrides any user-defined log settings. @param logLevel The desired log level as a string. Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN |
CancelJobGroup | Cancel active jobs for the specified group. See for more information. |
CancelAllJobs | Cancel all jobs that have been scheduled or are running. |
###Microsoft.Spark.CSharp.Core.StatCounter ####Summary
A class for tracking the statistics of a set of numbers (count, mean and variance) in a numerically
robust way. Includes support for merging two StatCounters. Based on Welford and Chan's algorithms
for running variance.
####Methods
Name | Description |
---|---|
Merge | Add a value into this StatCounter, updating the internal statistics. |
Merge | Add multiple values into this StatCounter, updating the internal statistics. |
Merge | Merge another StatCounter into this one, adding up the internal statistics. |
copy | Clone this StatCounter |
ToString | Returns a string that represents this StatCounter. |
###Microsoft.Spark.CSharp.Core.StatusTracker ####Summary
Low-level status reporting APIs for monitoring job and stage progress.
####Methods
Name | Description |
---|---|
GetJobIdsForGroup | Return a list of all known jobs in a particular job group. If `jobGroup` is None, then returns all known jobs that are not associated with a job group. The returned list may contain running, failed, and completed jobs, and may vary across invocations of this method. This method does not guarantee the order of the elements in its result. |
GetActiveStageIds | Returns an array containing the ids of all active stages. |
GetActiveJobsIds | Returns an array containing the ids of all active jobs. |
GetJobInfo | Returns a :class:`SparkJobInfo` object, or None if the job info could not be found or was garbage collected. |
GetStageInfo | Returns a :class:`SparkStageInfo` object, or None if the stage info could not be found or was garbage collected. |
###Microsoft.Spark.CSharp.Core.SparkJobInfo ####Summary
SparkJobInfo represents a job information of Spark
####Methods
Name | Description |
---|
###Microsoft.Spark.CSharp.Core.SparkStageInfo ####Summary
SparkJobInfo represents a stage information of Spark
####Methods
Name | Description |
---|
###Microsoft.Spark.CSharp.Core.StorageLevelType ####Summary
Defines the type of storage levels
###Microsoft.Spark.CSharp.Core.StorageLevel ####Summary
Flags for controlling the storage of an RDD. Each StorageLevel records whether to use
memory, whether to drop the RDD to disk if it falls out of memory, whether to keep the
data in memory in a serialized format, and whether to replicate the RDD partitions
on multiple nodes.
####Methods
Name | Description |
---|---|
ToString | Returns a readable string that represents the type |
###Microsoft.Spark.CSharp.Network.ByteBuf ####Summary
ByteBuf delimits a section of a ByteBufChunk.
It is the smallest unit to be allocated.
####Methods
Name | Description |
---|---|
Clear | Sets the readerIndex and writerIndex of this buffer to 0. |
IsReadable | Is this ByteSegment readable if and only if the buffer contains equal or more than the specified number of elements |
IsWritable | Returns true if and only if the buffer has enough Capacity to accommodate size additional bytes. |
ReadByte | Gets a byte at the current readerIndex and increases the readerIndex by 1 in this buffer. |
ReadBytes | Reads a block of bytes from the ByteBuf and writes the data to a buffer. |
Release | Release the ByteBuf back to the ByteBufPool |
WriteBytes | Writes a block of bytes to the ByteBuf using data read from a buffer. |
GetInputRioBuf | Returns a RioBuf object for input (receive) |
GetOutputRioBuf | Returns a RioBuf object for output (send). |
NewErrorStatusByteBuf | Creates an empty ByteBuf with error status. |
Finalizer. | |
Allocates a ByteBuf from this ByteChunk. | |
Release all resources | |
Releases the ByteBuf back to this ByteChunk | |
Returns a readable string for the ByteBufChunk | |
Static method to create a new ByteBufChunk with given segment and chunk size. If isUnsafe is true, it allocates memory from the process's heap. | |
Wraps HeapFree to process heap. | |
Implementation of the Dispose pattern. | |
Add the ByteBufChunk to this ByteBufChunkList linked-list based on ByteBufChunk's usage. So it will be moved to the right ByteBufChunkList that has the correct minUsage/maxUsage. | |
Allocates a ByteBuf from this ByteBufChunkList if it is not empty. | |
Releases the segment back to its ByteBufChunk. | |
Adds the ByteBufChunk to this ByteBufChunkList | |
Moves the ByteBufChunk down the ByteBufChunkList linked-list so it will end up in the right ByteBufChunkList that has the correct minUsage/maxUsage in respect to ByteBufChunk.Usage. | |
Remove the ByteBufChunk from this ByteBufChunkList | |
Returns a readable string for this ByteBufChunkList | |
Allocates a ByteBuf from this ByteBufPool to use. | |
Deallocates a ByteBuf back to this ByteBufPool. | |
Gets a readable string for this ByteBufPool | |
Returns the chunk numbers in each queue. |
###Microsoft.Spark.CSharp.Network.ByteBufChunk ####Summary
ByteBufChunk represents a memory blocks that can be allocated from
.Net heap (managed code) or process heap(unsafe code)
####Methods
Name | Description |
---|---|
Finalize | Finalizer. |
Allocate | Allocates a ByteBuf from this ByteChunk. |
Dispose | Release all resources |
Free | Releases the ByteBuf back to this ByteChunk |
ToString | Returns a readable string for the ByteBufChunk |
NewChunk | Static method to create a new ByteBufChunk with given segment and chunk size. If isUnsafe is true, it allocates memory from the process's heap. |
FreeToProcessHeap | Wraps HeapFree to process heap. |
Dispose | Implementation of the Dispose pattern. |
Add the ByteBufChunk to this ByteBufChunkList linked-list based on ByteBufChunk's usage. So it will be moved to the right ByteBufChunkList that has the correct minUsage/maxUsage. | |
Allocates a ByteBuf from this ByteBufChunkList if it is not empty. | |
Releases the segment back to its ByteBufChunk. | |
Adds the ByteBufChunk to this ByteBufChunkList | |
Moves the ByteBufChunk down the ByteBufChunkList linked-list so it will end up in the right ByteBufChunkList that has the correct minUsage/maxUsage in respect to ByteBufChunk.Usage. | |
Remove the ByteBufChunk from this ByteBufChunkList | |
Returns a readable string for this ByteBufChunkList |
###Microsoft.Spark.CSharp.Network.ByteBufChunk.Segment ####Summary
Segment struct delimits a section of a byte chunk.
###Microsoft.Spark.CSharp.Network.ByteBufChunkList ####Summary
ByteBufChunkList class represents a simple linked like list used to store ByteBufChunk objects
based on its usage.
####Methods
Name | Description |
---|---|
Add | Add the ByteBufChunk to this ByteBufChunkList linked-list based on ByteBufChunk's usage. So it will be moved to the right ByteBufChunkList that has the correct minUsage/maxUsage. |
Allocate | Allocates a ByteBuf from this ByteBufChunkList if it is not empty. |
Free | Releases the segment back to its ByteBufChunk. |
AddInternal | Adds the ByteBufChunk to this ByteBufChunkList |
MoveInternal | Moves the ByteBufChunk down the ByteBufChunkList linked-list so it will end up in the right ByteBufChunkList that has the correct minUsage/maxUsage in respect to ByteBufChunk.Usage. |
Remove | Remove the ByteBufChunk from this ByteBufChunkList |
ToString | Returns a readable string for this ByteBufChunkList |
###Microsoft.Spark.CSharp.Network.ByteBufPool ####Summary
ByteBufPool class is used to manage the ByteBuf pool that allocate and free pooled memory buffer.
We borrows some ideas from Netty buffer memory management.
####Methods
Name | Description |
---|---|
Allocate | Allocates a ByteBuf from this ByteBufPool to use. |
Free | Deallocates a ByteBuf back to this ByteBufPool. |
ToString | Gets a readable string for this ByteBufPool |
GetUsages | Returns the chunk numbers in each queue. |
###Microsoft.Spark.CSharp.Network.RioNative ####Summary
RioNative class imports and initializes RIOSock.dll for use with RIO socket APIs.
It also provided a simple thread pool that retrieves the results from IO completion port.
####Methods
Name | Description |
---|---|
Finalize | Finalizer |
Dispose | Release all resources. |
SetUseThreadPool | Sets whether use thread pool to query RIO socket results, it must be called before calling EnsureRioLoaded() |
EnsureRioLoaded | Ensures that the native dll of RIO socket is loaded and initialized. |
UnloadRio | Explicitly unload the native dll of RIO socket, and release resources. |
Init | Initializes RIOSock native library. |
###Microsoft.Spark.CSharp.Network.RioResult ####Summary
The RioResult structure contains data used to indicate request completion results used with RIO socket
###Microsoft.Spark.CSharp.Network.SocketStream ####Summary
Provides the underlying stream of data for network access.
Just like a NetworkStream.
####Methods
Name | Description |
---|---|
Flush | Flushes data in send cache to the stream. |
Seek | Seeks a specific position in the stream. This method is not supported by the SocketDataStream class. |
SetLength | Sets the length of the stream. This method is not supported by the SocketDataStream class. |
ReadByte | Reads a byte from the stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream. |
Read | Reads data from the stream. |
Write | Writes data to the stream. |
###Microsoft.Spark.CSharp.Network.SockDataToken ####Summary
SockDataToken class is used to associate with the SocketAsyncEventArgs object.
Primarily, it is a way to pass state to the event handler.
####Methods
Name | Description |
---|---|
Reset | Reset this token |
DetachData | Detach the data ownership. |
###Microsoft.Spark.CSharp.Network.SocketFactory ####Summary
SocketFactory is used to create ISocketWrapper instance based on a configuration and OS version.
The ISocket instance can be RioSocket object, if the configuration is set to RioSocket and
only the application is running on a Windows OS that supports Registered IO socket.
####Methods
Name | Description |
---|---|
CreateSocket | Creates a ISocket instance based on the configuration and OS version. |
IsRioSockSupported | Indicates whether current OS supports RIO socket. |
###Microsoft.Spark.CSharp.Sql.Builder ####Summary
The entry point to programming Spark with the Dataset and DataFrame API.
####Methods
Name | Description |
---|---|
Master | Sets the Spark master URL to connect to, such as "local" to run locally, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster. |
AppName | Sets a name for the application, which will be shown in the Spark web UI. If no application name is set, a randomly generated name will be used. |
Config | Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration. |
Config | Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration. |
Config | Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration. |
Config | Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession's own configuration. |
Config | Sets a list of config options based on the given SparkConf |
EnableHiveSupport | Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. |
GetOrCreate | Gets an existing [[SparkSession]] or, if there is no existing one, creates a new one based on the options set in this builder. |
###Microsoft.Spark.CSharp.Sql.Catalog.Catalog ####Summary
Catalog interface for Spark.
####Methods
Name | Description |
---|---|
ListDatabases | Returns a list of databases available across all sessions. |
ListTables | Returns a list of tables in the current database or given database This includes all temporary tables. |
ListColumns | Returns a list of columns for the given table in the current database or the given temporary table. |
ListFunctions | Returns a list of functions registered in the specified database. This includes all temporary functions |
SetCurrentDatabase | Sets the current default database in this session. |
DropTempView | Drops the temporary view with the given view name in the catalog. If the view has been cached before, then it will also be uncached. |
IsCached | Returns true if the table is currently cached in-memory. |
CacheTable | Caches the specified table in-memory. |
UnCacheTable | Removes the specified table from the in-memory cache. |
RefreshTable | Invalidate and refresh all the cached metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks.When those change outside of Spark SQL, users should call this function to invalidate the cache. If this table is cached as an InMemoryRelation, drop the original cached version and make the new version cached lazily. |
ClearCache | Removes all cached tables from the in-memory cache. |
CreateExternalTable | Creates an external table from the given path and returns the corresponding DataFrame. It will use the default data source configured by spark.sql.sources.default. |
CreateExternalTable | Creates an external table from the given path on a data source and returns DataFrame |
CreateExternalTable | Creates an external table from the given path based on a data source and a set of options. Then, returns the corresponding DataFrame. |
CreateExternalTable | Create an external table from the given path based on a data source, a schema and a set of options.Then, returns the corresponding DataFrame. |
###Microsoft.Spark.CSharp.Sql.Catalog.Database ####Summary
A database in Spark
###Microsoft.Spark.CSharp.Sql.Catalog.Table ####Summary
A table in Spark
###Microsoft.Spark.CSharp.Sql.Catalog.Column ####Summary
A column in Spark
###Microsoft.Spark.CSharp.Sql.Catalog.Function ####Summary
A user-defined function in Spark
###Microsoft.Spark.CSharp.Sql.Column ####Summary
A column that will be computed based on the data in a DataFrame.
####Methods
Name | Description |
---|---|
op_LogicalNot | The logical negation operator that negates its operand. |
op_UnaryNegation | Negation of itself. |
op_Addition | Sum of this expression and another expression. |
op_Subtraction | Subtraction of this expression and another expression. |
op_Multiply | Multiplication of this expression and another expression. |
op_Division | Division this expression by another expression. |
op_Modulus | Modulo (a.k.a. remainder) expression. |
op_Equality | The equality operator returns true if the values of its operands are equal, false otherwise. |
op_Inequality | The inequality operator returns false if its operands are equal, true otherwise. |
op_LessThan | The "less than" relational operator that returns true if the first operand is less than the second, false otherwise. |
op_LessThanOrEqual | The "less than or equal" relational operator that returns true if the first operand is less than or equal to the second, false otherwise. |
op_GreaterThanOrEqual | The "greater than or equal" relational operator that returns true if the first operand is greater than or equal to the second, false otherwise. |
op_GreaterThan | The "greater than" relational operator that returns true if the first operand is greater than the second, false otherwise. |
op_BitwiseOr | Compute bitwise OR of this expression with another expression. |
op_BitwiseAnd | Compute bitwise AND of this expression with another expression. |
op_ExclusiveOr | Compute bitwise XOR of this expression with another expression. |
GetHashCode | Required when operator == or operator != is defined |
Equals | Required when operator == or operator != is defined |
Like | SQL like expression. |
RLike | SQL RLIKE expression (LIKE with Regex). |
StartsWith | String starts with another string literal. |
EndsWith | String ends with another string literal. |
Asc | Returns a sort expression based on the ascending order. |
Desc | Returns a sort expression based on the descending order. |
Alias | Returns this column aliased with a new name. |
Alias | Returns this column aliased with new names |
Cast | Casts the column to a different data type, using the canonical string representation of the type. The supported types are: `string`, `boolean`, `byte`, `short`, `int`, `long`, `float`, `double`, `decimal`, `date`, `timestamp`. E.g. // Casts colA to integer. df.select(df("colA").cast("int")) |
###Microsoft.Spark.CSharp.Sql.DataFrame ####Summary
A distributed collection of data organized into named columns.
See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
####Methods
Name | Description |
---|---|
RegisterTempTable | Registers this DataFrame as a temporary table using the given name. The lifetime of this temporary table is tied to the SqlContext that was used to create this DataFrame. |
Count | Number of rows in the DataFrame |
Show | Displays rows of the DataFrame in tabular form |
ShowSchema | Prints the schema information of the DataFrame |
Collect | Returns all of Rows in this DataFrame |
ToRDD | Converts the DataFrame to RDD of Row |
ToJSON | Returns the content of the DataFrame as RDD of JSON strings |
Explain | Prints the plans (logical and physical) to the console for debugging purposes |
Select | Selects a set of columns specified by column name or Column. df.Select("colA", df["colB"]) df.Select("*", df["colB"] + 10) |
Select | Selects a set of columns. This is a variant of `select` that can only select existing columns using column names (i.e. cannot construct expressions). df.Select("colA", "colB") |
SelectExpr | Selects a set of SQL expressions. This is a variant of `select` that accepts SQL expressions. df.SelectExpr("colA", "colB as newName", "abs(colC)") |
Where | Filters rows using the given condition |
Filter | Filters rows using the given condition |
GroupBy | Groups the DataFrame using the specified columns, so we can run aggregation on them. |
Rollup | Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. |
Cube | Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregation on them. |
Agg | Aggregates on the DataFrame for the given column-aggregate function mapping |
Join | Join with another DataFrame - Cartesian join |
Join | Join with another DataFrame - Inner equi-join using given column name |
Join | Join with another DataFrame - Inner equi-join using given column name |
Join | Join with another DataFrame, using the specified JoinType |
Intersect | Intersect with another DataFrame. This is equivalent to `INTERSECT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, intersect(self, other) |
UnionAll | Union with another DataFrame WITHOUT removing duplicated rows. This is equivalent to `UNION ALL` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, unionAll(self, other) |
Subtract | Returns a new DataFrame containing rows in this frame but not in another frame. This is equivalent to `EXCEPT` in SQL. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, subtract(self, other) |
Drop | Returns a new DataFrame with a column dropped. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, drop(self, col) |
DropNa | Returns a new DataFrame omitting rows with null values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropna(self, how='any', thresh=None, subset=None) |
Na | Returns a DataFrameNaFunctions for working with missing data. |
FillNa | Replace null values, alias for ``na.fill()` |
DropDuplicates | Returns a new DataFrame with duplicate rows removed, considering only the subset of columns. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dropDuplicates(self, subset=None) |
Replace``1 | Returns a new DataFrame replacing a value with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None) |
ReplaceAll``1 | Returns a new DataFrame replacing values with other values. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None) |
ReplaceAll``1 | Returns a new DataFrame replacing values with another value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, replace(self, to_replace, value, subset=None) |
RandomSplit | Randomly splits this DataFrame with the provided weights. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, randomSplit(self, weights, seed=None) |
Columns | Returns all column names as a list. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, columns(self) |
DTypes | Returns all column names and their data types. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, dtypes(self) |
Sort | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, *cols, **kwargs) |
Sort | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, sort(self, *cols, **kwargs) |
SortWithinPartitions | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, *cols, **kwargs) |
SortWithinPartition | Returns a new DataFrame sorted by the specified column(s). Reference to https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/dataframe.py, sortWithinPartitions(self, *cols, **kwargs) |
Alias | Returns a new DataFrame with an alias set. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, alias(self, alias) |
WithColumn | Returns a new DataFrame by adding a column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumn(self, colName, col) |
WithColumnRenamed | Returns a new DataFrame by renaming an existing column. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, withColumnRenamed(self, existing, new) |
Corr | Calculates the correlation of two columns of a DataFrame as a double value. Currently only supports the Pearson Correlation Coefficient. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, corr(self, col1, col2, method=None) |
Cov | Calculate the sample covariance of two columns as a double value. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, cov(self, col1, col2) |
FreqItems | Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in "http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou". Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, freqItems(self, cols, support=None) Note: This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame. |
Crosstab | Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. Reference to https://github.com/apache/spark/blob/branch-1.4/python/pyspark/sql/dataframe.py, crosstab(self, col1, col2) |
Describe | Computes statistics for numeric columns. This include count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical columns. |
Limit | Returns a new DataFrame by taking the first `n` rows. The difference between this function and `head` is that `head` returns an array while `limit` returns a new DataFrame. |
Head | Returns the first `n` rows. |
First | Returns the first row. |
Take | Returns the first `n` rows in the DataFrame. |
Distinct | Returns a new DataFrame that contains only the unique rows from this DataFrame. |
Coalesce | Returns a new DataFrame that has exactly `numPartitions` partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. |
Persist | Persist this DataFrame with the default storage level (`MEMORY_AND_DISK`) |
Unpersist | Mark the DataFrame as non-persistent, and remove all blocks for it from memory and disk. |
Cache | Persist this DataFrame with the default storage level (`MEMORY_AND_DISK`) |
Repartition | Returns a new DataFrame that has exactly `numPartitions` partitions. |
Repartition | Returns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions. |
Repartition | Returns a new [[DataFrame]] partitioned by the given partitioning columns into . The resulting DataFrame is hash partitioned. optional. If not specified, keep current partitions. |
Sample | Returns a new DataFrame by sampling a fraction of rows. |
FlatMap``1 | Returns a new RDD by first applying a function to all rows of this DataFrame, and then flattening the results. |
Map``1 | Returns a new RDD by applying a function to all rows of this DataFrame. |
MapPartitions``1 | Returns a new RDD by applying a function to each partition of this DataFrame. |
ForeachPartition | Applies a function f to each partition of this DataFrame. |
Foreach | Applies a function f to all rows. |
Write | Interface for saving the content of the DataFrame out into external storage. |
SaveAsParquetFile | Saves the contents of this DataFrame as a parquet file, preserving the schema. Files that are written out using this method can be read back in as a DataFrame using the `parquetFile` function in SQLContext. |
InsertInto | Adds the rows from this RDD to the specified table, optionally overwriting the existing data. |
SaveAsTable | Creates a table from the the contents of this DataFrame based on a given data source, SaveMode specified by mode, and a set of options. Note that this currently only works with DataFrames that are created from a HiveContext as there is no notion of a persisted catalog in a standard SQL context. Instead you can write an RDD out to a parquet file, and then register that file as a table. This "table" can then be the target of an `insertInto`. Also note that while this function can persist the table metadata into Hive's metastore, the table will NOT be accessible from Hive, until SPARK-7550 is resolved. |
Save | Saves the contents of this DataFrame based on the given data source, SaveMode specified by mode, and a set of options. |
Returns a new DataFrame that drops rows containing any null values. | |
Returns a new DataFrame that drops rows containing null values. If `how` is "any", then drop rows containing any null values. If `how` is "all", then drop rows only if every column is null for that row. | |
Returns a new [[DataFrame]] that drops rows containing null values in the specified columns. If `how` is "any", then drop rows containing any null values in the specified columns. If `how` is "all", then drop rows only if every specified column is null for that row. | |
Returns a new DataFrame that drops rows containing any null values in the specified columns. | |
Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values. | |
Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values values in the specified columns. | |
Returns a new DataFrame that replaces null values in numeric columns with `value`. | |
Returns a new DataFrame that replaces null values in string columns with `value`. | |
Returns a new DataFrame that replaces null values in specified numeric columns. If a specified column is not a numeric column, it is ignored. | |
Returns a new DataFrame that replaces null values in specified string columns. If a specified column is not a numeric column, it is ignored. | |
Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. The value must be of the following type: `Integer`, `Long`, `Float`, `Double`, `String`. For example, the following replaces null values in column "A" with string "unknown", and null values in column "B" with numeric value 1.0. import com.google.common.collect.ImmutableMap; df.na.fill(ImmutableMap.of("A", "unknown", "B", 1.0)); | |
Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height". df.replace("height", ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed")); // Replaces all occurrences of "UNKNOWN" with "unnamed" in all string columns. df.replace("*", ImmutableMap.of("UNKNOWN", "unnamed")); | |
Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height" and "weight". df.replace(new String[] {"height", "weight"}, ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "firstname" and "lastname". df.replace(new String[] {"firstname", "lastname"}, ImmutableMap.of("UNKNOWN", "unnamed")); | |
Specifies the input data source format. | |
Specifies the input schema. Some data sources (e.g. JSON) can infer the input schema automatically from data. By specifying the schema here, the underlying data source can skip the schema inference step, and thus speed up data loading. | |
Adds an input option for the underlying data source. | |
Adds input options for the underlying data source. | |
Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by a local or distributed file system). | |
Loads input in as a DataFrame, for data sources that don't require a path (e.g. external key-value stores). | |
Construct a [[DataFrame]] representing the database table accessible via JDBC URL, url named table and connection properties. | |
Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. | |
Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. | |
Loads a JSON file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan. | |
Loads a Parquet file, returning the result as a [[DataFrame]]. This function returns an empty DataFrame if no paths are passed in. | |
Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime. | |
Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime. | |
Specifies the underlying output data source. Built-in options include "parquet", "json", etc. | |
Adds an output option for the underlying data source. | |
Adds output options for the underlying data source. | |
Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. This is only applicable for Parquet at the moment. | |
Saves the content of the DataFrame at the specified path. | |
Saves the content of the DataFrame as the specified table. | |
Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table. Because it inserts data to an existing table, format or options will be ignored. | |
Saves the content of the DataFrame as the specified table. In the case the table already exists, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). When `mode` is `Overwrite`, the schema of the DataFrame does not need to be the same as that of the existing table. When `mode` is `Append`, the schema of the DataFrame need to be the same as that of the existing table, and format or options will be ignored. | |
Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. | |
Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("json").Save(path) | |
Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("parquet").Save(path) |
###Microsoft.Spark.CSharp.Sql.JoinType ####Summary
The type of join operation for DataFrame
###Microsoft.Spark.CSharp.Sql.GroupedData ####Summary
A set of methods for aggregations on a DataFrame, created by DataFrame.groupBy.
####Methods
Name | Description |
---|---|
Agg | Compute aggregates by specifying a dictionary from column name to aggregate methods. The available aggregate methods are avg, max, min, sum, count. |
Count | Count the number of rows for each group. |
Mean | Compute the average value for each numeric columns for each group. This is an alias for avg. When specified columns are given, only compute the average values for them. |
Max | Compute the max value for each numeric columns for each group. When specified columns are given, only compute the max values for them. |
Min | Compute the min value for each numeric column for each group. |
Avg | Compute the mean value for each numeric columns for each group. When specified columns are given, only compute the mean values for them. |
Sum | Compute the sum for each numeric columns for each group. When specified columns are given, only compute the sum for them. |
###Microsoft.Spark.CSharp.Sql.DataFrameNaFunctions ####Summary
Functionality for working with missing data in DataFrames.
####Methods
Name | Description |
---|---|
Drop | Returns a new DataFrame that drops rows containing any null values. |
Drop | Returns a new DataFrame that drops rows containing null values. If `how` is "any", then drop rows containing any null values. If `how` is "all", then drop rows only if every column is null for that row. |
Drop | Returns a new [[DataFrame]] that drops rows containing null values in the specified columns. If `how` is "any", then drop rows containing any null values in the specified columns. If `how` is "all", then drop rows only if every specified column is null for that row. |
Drop | Returns a new DataFrame that drops rows containing any null values in the specified columns. |
Drop | Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values. |
Drop | Returns a new DataFrame that drops rows containing less than `minNonNulls` non-null values values in the specified columns. |
Fill | Returns a new DataFrame that replaces null values in numeric columns with `value`. |
Fill | Returns a new DataFrame that replaces null values in string columns with `value`. |
Fill | Returns a new DataFrame that replaces null values in specified numeric columns. If a specified column is not a numeric column, it is ignored. |
Fill | Returns a new DataFrame that replaces null values in specified string columns. If a specified column is not a numeric column, it is ignored. |
Fill | Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. The value must be of the following type: `Integer`, `Long`, `Float`, `Double`, `String`. For example, the following replaces null values in column "A" with string "unknown", and null values in column "B" with numeric value 1.0. import com.google.common.collect.ImmutableMap; df.na.fill(ImmutableMap.of("A", "unknown", "B", 1.0)); |
Replace``1 | Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height". df.replace("height", ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "name". df.replace("name", ImmutableMap.of("UNKNOWN", "unnamed")); // Replaces all occurrences of "UNKNOWN" with "unnamed" in all string columns. df.replace("*", ImmutableMap.of("UNKNOWN", "unnamed")); |
Replace``1 | Replaces values matching keys in `replacement` map with the corresponding values. Key and value of `replacement` map must have the same type, and can only be doubles or strings. If `col` is "*", then the replacement is applied on all string columns or numeric columns. Example: import com.google.common.collect.ImmutableMap; // Replaces all occurrences of 1.0 with 2.0 in column "height" and "weight". df.replace(new String[] {"height", "weight"}, ImmutableMap.of(1.0, 2.0)); // Replaces all occurrences of "UNKNOWN" with "unnamed" in column "firstname" and "lastname". df.replace(new String[] {"firstname", "lastname"}, ImmutableMap.of("UNKNOWN", "unnamed")); |
###Microsoft.Spark.CSharp.Sql.DataFrameReader ####Summary
Interface used to load a DataFrame from external storage systems (e.g. file systems,
key-value stores, etc). Use SQLContext.read() to access this.
####Methods
Name | Description |
---|---|
Format | Specifies the input data source format. |
Schema | Specifies the input schema. Some data sources (e.g. JSON) can infer the input schema automatically from data. By specifying the schema here, the underlying data source can skip the schema inference step, and thus speed up data loading. |
Option | Adds an input option for the underlying data source. |
Options | Adds input options for the underlying data source. |
Load | Loads input in as a [[DataFrame]], for data sources that require a path (e.g. data backed by a local or distributed file system). |
Load | Loads input in as a DataFrame, for data sources that don't require a path (e.g. external key-value stores). |
Jdbc | Construct a [[DataFrame]] representing the database table accessible via JDBC URL, url named table and connection properties. |
Jdbc | Construct a DataFrame representing the database table accessible via JDBC URL url named table. Partitions of the table will be retrieved in parallel based on the parameters passed to this function. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. |
Jdbc | Construct a DataFrame representing the database table accessible via JDBC URL url named table using connection properties. The `predicates` parameter gives a list expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame. Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. |
Json | Loads a JSON file (one object per line) and returns the result as a DataFrame. This function goes through the input once to determine the input schema. If you know the schema in advance, use the version that specifies the schema to avoid the extra scan. |
Parquet | Loads a Parquet file, returning the result as a [[DataFrame]]. This function returns an empty DataFrame if no paths are passed in. |
###Microsoft.Spark.CSharp.Sql.DataFrameWriter ####Summary
Interface used to write a DataFrame to external storage systems (e.g. file systems,
key-value stores, etc). Use DataFrame.Write to access this.
See also http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
####Methods
Name | Description |
---|---|
Mode | Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime. |
Mode | Specifies the behavior when data or table already exists. Options include: - `SaveMode.Overwrite`: overwrite the existing data. - `SaveMode.Append`: append the data. - `SaveMode.Ignore`: ignore the operation (i.e. no-op). - `SaveMode.ErrorIfExists`: default option, throw an exception at runtime. |
Format | Specifies the underlying output data source. Built-in options include "parquet", "json", etc. |
Option | Adds an output option for the underlying data source. |
Options | Adds output options for the underlying data source. |
PartitionBy | Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. This is only applicable for Parquet at the moment. |
Save | Saves the content of the DataFrame at the specified path. |
Save | Saves the content of the DataFrame as the specified table. |
InsertInto | Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table. Because it inserts data to an existing table, format or options will be ignored. |
SaveAsTable | Saves the content of the DataFrame as the specified table. In the case the table already exists, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). When `mode` is `Overwrite`, the schema of the DataFrame does not need to be the same as that of the existing table. When `mode` is `Append`, the schema of the DataFrame need to be the same as that of the existing table, and format or options will be ignored. |
Jdbc | Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the `mode` function (default to throwing an exception). Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems. |
Json | Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("json").Save(path) |
Parquet | Saves the content of the DataFrame in JSON format at the specified path. This is equivalent to: Format("parquet").Save(path) |
###Microsoft.Spark.CSharp.Sql.Dataset ####Summary
Dataset is a strongly typed collection of domain-specific objects that can be transformed
in parallel using functional or relational operations.Each Dataset also has an untyped view
called a DataFrame, which is a Dataset of Row.
####Methods
Name | Description |
---|---|
ToDF | Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic[[Row]] objects that allow fields to be accessed by ordinal or name. |
PrintSchema | Prints the schema to the console in a nice tree format. |
Explain | Prints the plans (logical and physical) to the console for debugging purposes. |
Explain | Prints the physical plan to the console for debugging purposes. |
DTypes | Returns all column names and their data types as an array. |
Columns | Returns all column names as an array. |
Show | Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. |
ShowSchema | Prints schema |
###Microsoft.Spark.CSharp.Sql.Dataset`1 ####Summary
Dataset of specific types
Type parameter
###Microsoft.Spark.CSharp.Sql.HiveContext ####Summary
HiveContext is deprecated. Use SparkSession.Builder().EnableHiveSupport()
HiveContext is a variant of Spark SQL that integrates with data stored in Hive.
Configuration for Hive is read from hive-site.xml on the classpath.
It supports running both SQL and HiveQL commands.
####Methods
Name | Description |
---|---|
Sql | Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect' |
RefreshTable | Invalidate and refresh all the cached the metadata of the given table. For performance reasons, Spark SQL or the external data source library it uses might cache certain metadata about a table, such as the location of blocks. When those change outside of Spark SQL, users should call this function to invalidate the cache. |
###Microsoft.Spark.CSharp.Sql.PythonSerDe ####Summary
Used for SerDe of Python objects
####Methods
Name | Description |
---|---|
GetUnpickledObjects | Unpickles objects from byte[] |
###Microsoft.Spark.CSharp.Sql.RowConstructor ####Summary
Used by Unpickler to unpickle pickled objects. It is also used to construct a Row (C# representation of pickled objects).
####Methods
Name | Description |
---|---|
ToString | Returns a string that represents the current object. |
construct | Used by Unpickler - do not use to construct Row. Use GetRow() method |
GetRow | Used to construct a Row |
###Microsoft.Spark.CSharp.Sql.Row ####Summary
Represents one row of output from a relational operator.
####Methods
Name | Description |
---|---|
Returns a string that represents the current object. | |
Used by Unpickler - do not use to construct Row. Use GetRow() method | |
Used to construct a Row | |
Size | Number of elements in the Row. |
GetSchema | Schema for the row. |
Get | Returns the value at position i. |
Get | Returns the value of a given columnName. |
GetAs``1 | Returns the value at position i, the return value will be cast to type T. |
GetAs``1 | Returns the value of a given columnName, the return value will be cast to type T. |
###Microsoft.Spark.CSharp.Sql.Functions ####Summary
DataFrame Built-in functions
####Methods
Name | Description |
---|---|
Lit | Creates a Column of any literal value. |
Col | Returns a Column based on the given column name. |
Column | Returns a Column based on the given column name. |
Asc | Returns a sort expression based on ascending order of the column. |
Desc | Returns a sort expression based on the descending order of the column. |
Upper | Converts a string column to upper case. |
Lower | Converts a string column to lower case. |
Sqrt | Computes the square root of the specified float column. |
Abs | Computes the absolute value. |
Max | Returns the maximum value of the expression in a group. |
Min | Returns the minimum value of the expression in a group. |
First | Returns the first value in a group. |
Last | Returns the last value in a group. |
Count | Returns the number of items in a group. |
Sum | Returns the sum of all values in the expression. |
Avg | Returns the average of the values in a group. |
Mean | Returns the average of the values in a group. |
SumDistinct | Returns the sum of distinct values in the expression. |
Array | Creates a new array column. The input columns must all have the same data type. |
Coalesce | Returns the first column that is not null, or null if all inputs are null. |
CountDistinct | Returns the number of distinct items in a group. |
Struct | Creates a new struct column. |
ApproxCountDistinct | Returns the approximate number of distinct items in a group |
Explode | Creates a new row for each element in the given array or map column. |
Rand | Generate a random column with i.i.d. samples from U[0.0, 1.0]. |
Randn | Generate a column with i.i.d. samples from the standard normal distribution. |
Ntile | Returns the ntile group id (from 1 to n inclusive) in an ordered window partition. This is equivalent to the NTILE function in SQL. |
Acos | Computes the cosine inverse of the given column; the returned angle is in the range 0. |
Asin | Computes the sine inverse of the given column; the returned angle is in the range -pi/2 through pi/2. |
Atan | Computes the tangent inverse of the given column. |
Cbrt | Computes the cube-root of the given column. |
Ceil | Computes the ceiling of the given column. |
Cos | Computes the cosine of the given column. |
Cosh | Computes the hyperbolic cosine of the given column. |
Exp | Computes the exponential of the given column. |
Expm1 | Computes the exponential of the given value minus column. |
Floor | Computes the floor of the given column. |
Log | Computes the natural logarithm of the given column. |
Log10 | Computes the logarithm of the given column in base 10. |
Log1p | Computes the natural logarithm of the given column plus one. |
Rint | Returns the double value that is closest in value to the argument and is equal to a mathematical integer. |
Signum | Computes the signum of the given column. |
Sin | Computes the sine of the given column. |
Sinh | Computes the hyperbolic sine of the given column. |
Tan | Computes the tangent of the given column. |
Tanh | Computes the hyperbolic tangent of the given column. |
ToDegrees | Converts an angle measured in radians to an approximately equivalent angle measured in degrees. |
ToRadians | Converts an angle measured in degrees to an approximately equivalent angle measured in radians. |
BitwiseNOT | Computes bitwise NOT. |
Atan2 | Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta). |
Hypot | Computes sqrt(a2 + b2) without intermediate overflow or underflow. |
Hypot | Computes sqrt(a2 + b2) without intermediate overflow or underflow. |
Hypot | Computes sqrt(a2 + b2) without intermediate overflow or underflow. |
Pow | Returns the value of the first argument raised to the power of the second argument. |
Pow | Returns the value of the first argument raised to the power of the second argument. |
Pow | Returns the value of the first argument raised to the power of the second argument. |
ApproxCountDistinct | Returns the approximate number of distinct items in a group. |
When | Evaluates a list of conditions and returns one of multiple possible result expressions. |
Lag | Returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. |
Lead | Returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. |
RowNumber | Returns a sequential number starting at 1 within a window partition. |
DenseRank | Returns the rank of rows within a window partition, without any gaps. |
Rank | Returns the rank of rows within a window partition. |
CumeDist | Returns the cumulative distribution of values within a window partition |
PercentRank | Returns the relative rank (i.e. percentile) of rows within a window partition. |
MonotonicallyIncreasingId | A column expression that generates monotonically increasing 64-bit integers. |
SparkPartitionId | Partition ID of the Spark task. Note that this is indeterministic because it depends on data partitioning and task scheduling. |
Rand | Generate a random column with i.i.d. samples from U[0.0, 1.0]. |
Randn | Generate a column with i.i.d. samples from the standard normal distribution. |
Udf``1 | Defines a user-defined function of 0 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``2 | Defines a user-defined function of 1 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``3 | Defines a user-defined function of 2 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``4 | Defines a user-defined function of 3 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``5 | Defines a user-defined function of 4 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``6 | Defines a user-defined function of 5 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``7 | Defines a user-defined function of 6 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``8 | Defines a user-defined function of 7 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``9 | Defines a user-defined function of 8 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``10 | Defines a user-defined function of 9 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
Udf``11 | Defines a user-defined function of 10 arguments as user-defined function (UDF). The data types are automatically inferred based on the function's signature. |
###Microsoft.Spark.CSharp.Sql.SaveMode ####Summary
SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.
####Methods
Name | Description |
---|---|
Gets the string for the value of SaveMode |
###Microsoft.Spark.CSharp.Sql.SaveModeExtensions ####Summary
For SaveMode.ErrorIfExists, the corresponding literal string in spark is "error" or "default".
####Methods
Name | Description |
---|---|
GetStringValue | Gets the string for the value of SaveMode |
###Microsoft.Spark.CSharp.Sql.SparkSession ####Summary
The entry point to programming Spark with the Dataset and DataFrame API.
####Methods
Name | Description |
---|---|
Builder | Builder for SparkSession |
NewSession | Start a new session with isolated SQL configurations, temporary tables, registered functions are isolated, but sharing the underlying [[SparkContext]] and cached data. Note: Other than the [[SparkContext]], all shared state is initialized lazily. This method will force the initialization of the shared state to ensure that parent and child sessions are set up with the same shared state. If the underlying catalog implementation is Hive, this will initialize the metastore, which may take some time. |
Stop | Stop underlying SparkContext |
Read | Returns a DataFrameReader that can be used to read non-streaming data in as a DataFrame |
CreateDataFrame | Creates a from a RDD containing array of object using the given schema. |
Table | Returns the specified table as a |
Sql | Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect' |
###Microsoft.Spark.CSharp.Sql.SqlContext ####Summary
The entry point for working with structured data (rows and columns) in Spark.
Allows the creation of [[DataFrame]] objects as well as the execution of SQL queries.
####Methods
Name | Description |
---|---|
GetOrCreate | Get the existing SQLContext or create a new one with given SparkContext. |
NewSession | Returns a new SQLContext as new session, that has separate SQLConf, registered temporary tables and UDFs, but shared SparkContext and table cache. |
GetConf | Returns the value of Spark SQL configuration property for the given key. If the key is not set, returns defaultValue. |
SetConf | Sets the given Spark SQL configuration property. |
Read | Returns a DataFrameReader that can be used to read data in as a DataFrame. |
ReadDataFrame | Loads a dataframe the source path using the given schema and options |
CreateDataFrame | Creates a from a RDD containing array of object using the given schema. |
RegisterDataFrameAsTable | Registers the given as a temporary table in the catalog. Temporary tables exist only during the lifetime of this instance of SqlContext. |
DropTempTable | Remove the temp table from catalog. |
Table | Returns the specified table as a |
Tables | Returns a containing names of tables in the given database. If is not specified, the current database will be used. The returned DataFrame has two columns: 'tableName' and 'isTemporary' (a column with bool type indicating if a table is a temporary one or not). |
TableNames | Returns a list of names of tables in the database |
CacheTable | Caches the specified table in-memory. |
UncacheTable | Removes the specified table from the in-memory cache. |
ClearCache | Removes all cached tables from the in-memory cache. |
IsCached | Returns true if the table is currently cached in-memory. |
Sql | Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect' |
JsonFile | Loads a JSON file (one object per line), returning the result as a DataFrame It goes through the entire dataset once to determine the schema. |
JsonFile | Loads a JSON file (one object per line) and applies the given schema |
TextFile | Loads text file with the specific column delimited using the given schema |
TextFile | Loads a text file (one object per line), returning the result as a DataFrame |
RegisterFunction``1 | Register UDF with no input argument, e.g: SqlContext.RegisterFunction<bool>("MyFilter", () => true); sqlContext.Sql("SELECT * FROM MyTable where MyFilter()"); |
RegisterFunction``2 | Register UDF with 1 input argument, e.g: SqlContext.RegisterFunction<bool, string>("MyFilter", (arg1) => arg1 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1)"); |
RegisterFunction``3 | Register UDF with 2 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string>("MyFilter", (arg1, arg2) => arg1 != null && arg2 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2)"); |
RegisterFunction``4 | Register UDF with 3 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, string>("MyFilter", (arg1, arg2, arg3) => arg1 != null && arg2 != null && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, columnName3)"); |
RegisterFunction``5 | Register UDF with 4 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg4) => arg1 != null && arg2 != null && ... && arg3 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName4)"); |
RegisterFunction``6 | Register UDF with 5 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg5) => arg1 != null && arg2 != null && ... && arg5 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName5)"); |
RegisterFunction``7 | Register UDF with 6 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg6) => arg1 != null && arg2 != null && ... && arg6 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName6)"); |
RegisterFunction``8 | Register UDF with 7 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg7) => arg1 != null && arg2 != null && ... && arg7 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName7)"); |
RegisterFunction``9 | Register UDF with 8 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg8) => arg1 != null && arg2 != null && ... && arg8 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName8)"); |
RegisterFunction``10 | Register UDF with 9 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg9) => arg1 != null && arg2 != null && ... && arg9 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName9)"); |
RegisterFunction``11 | Register UDF with 10 input arguments, e.g: SqlContext.RegisterFunction<bool, string, string, ..., string>("MyFilter", (arg1, arg2, ..., arg10) => arg1 != null && arg2 != null && ... && arg10 != null); sqlContext.Sql("SELECT * FROM MyTable where MyFilter(columnName1, columnName2, ..., columnName10)"); |
###Microsoft.Spark.CSharp.Sql.DataType ####Summary
The base type of all Spark SQL data types.
####Methods
Name | Description |
---|---|
ParseDataTypeFromJson | Parses a Json string to construct a DataType. |
ParseDataTypeFromJson | Parse a JToken object to construct a DataType. |
###Microsoft.Spark.CSharp.Sql.AtomicType ####Summary
An internal type used to represent a simple type.
###Microsoft.Spark.CSharp.Sql.ComplexType ####Summary
An internal type used to represent a complex type (such as arrays, structs, and maps).
####Methods
Name | Description |
---|---|
FromJson | Abstract method that constructs a complex type from a Json object |
FromJson | Constructs a complex type from a Json string |
###Microsoft.Spark.CSharp.Sql.NullType ####Summary
The data type representing NULL values.
###Microsoft.Spark.CSharp.Sql.StringType ####Summary
The data type representing String values.
###Microsoft.Spark.CSharp.Sql.BinaryType ####Summary
The data type representing binary values.
###Microsoft.Spark.CSharp.Sql.BooleanType ####Summary
The data type representing Boolean values.
###Microsoft.Spark.CSharp.Sql.DateType ####Summary
The data type representing Date values.
###Microsoft.Spark.CSharp.Sql.TimestampType ####Summary
The data type representing Timestamp values.
###Microsoft.Spark.CSharp.Sql.DoubleType ####Summary
The data type representing Double values.
###Microsoft.Spark.CSharp.Sql.FloatType ####Summary
###Microsoft.Spark.CSharp.Sql.ByteType ####Summary
The data type representing Float values.
###Microsoft.Spark.CSharp.Sql.IntegerType ####Summary
###Microsoft.Spark.CSharp.Sql.LongType ####Summary
The data type representing Int values.
###Microsoft.Spark.CSharp.Sql.ShortType ####Summary
The data type representing Short values.
###Microsoft.Spark.CSharp.Sql.DecimalType ####Summary
The data type representing Decimal values.
####Methods
Name | Description |
---|---|
FromJson | Constructs a DecimalType from a Json object |
###Microsoft.Spark.CSharp.Sql.ArrayType ####Summary
The data type for collections of multiple values.
####Methods
Name | Description |
---|---|
FromJson | Constructs a ArrayType from a Json object |
###Microsoft.Spark.CSharp.Sql.MapType ####Summary
The data type for Maps. Not implemented yet.
####Methods
Name | Description |
---|---|
FromJson | Constructs a StructField from a Json object. Not implemented yet. |
###Microsoft.Spark.CSharp.Sql.StructField ####Summary
A field inside a StructType.
####Methods
Name | Description |
---|---|
FromJson | Constructs a StructField from a Json object |
###Microsoft.Spark.CSharp.Sql.StructType ####Summary
Struct type, consisting of a list of StructField
This is the data type representing a Row
####Methods
Name | Description |
---|---|
FromJson | Constructs a StructType from a Json object |
###Microsoft.Spark.CSharp.Streaming.ConstantInputDStream`1 ####Summary
An input stream that always returns the same RDD on each timestep. Useful for testing.
####Methods
Name | Description |
---|
###Microsoft.Spark.CSharp.Streaming.DStream`1 ####Summary
A Discretized Stream (DStream), the basic abstraction in Spark Streaming,
is a continuous sequence of RDDs (of the same type) representing a
continuous stream of data (see ) in the Spark core documentation
for more details on RDDs).
DStreams can either be created from live data (such as, data from TCP
sockets, Kafka, Flume, etc.) using a or it can be
generated by transforming existing DStreams using operations such as
`Map`, `Window` and `ReduceByKeyAndWindow`. While a Spark Streaming
program is running, each DStream periodically generates a RDD, either
from live data or by transforming the RDD generated by a parent DStream.
DStreams internally is characterized by a few basic properties:
- A list of other DStreams that the DStream depends on
- A time interval at which the DStream generates an RDD
- A function that is used to generate an RDD after each time interval
####Methods
Name | Description |
---|---|
Count | Return a new DStream in which each RDD has a single element generated by counting each RDD of this DStream. |
Filter | Return a new DStream containing only the elements that satisfy predicate. |
FlatMap``1 | Return a new DStream by applying a function to all elements of this DStream, and then flattening the results |
Map``1 | Return a new DStream by applying a function to each element of DStream. |
MapPartitions``1 | Return a new DStream in which each RDD is generated by applying mapPartitions() to each RDDs of this DStream. |
MapPartitionsWithIndex``1 | Return a new DStream in which each RDD is generated by applying mapPartitionsWithIndex() to each RDDs of this DStream. |
Reduce | Return a new DStream in which each RDD has a single element generated by reducing each RDD of this DStream. |
ForeachRDD | Apply a function to each RDD in this DStream. |
ForeachRDD | Apply a function to each RDD in this DStream. |
Print the first num elements of each RDD generated in this DStream. @param num: the number of elements from the first will be printed. | |
Glom | Return a new DStream in which RDD is generated by applying glom() to RDD of this DStream. |
Cache | Persist the RDDs of this DStream with the default storage level . |
Persist | Persist the RDDs of this DStream with the given storage level |
Checkpoint | Enable periodic checkpointing of RDDs of this DStream |
CountByValue | Return a new DStream in which each RDD contains the counts of each distinct value in each RDD of this DStream. |
SaveAsTextFiles | Save each RDD in this DStream as text file, using string representation of elements. |
Transform``1 | Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream. `func` can have one argument of `rdd`, or have two arguments of (`time`, `rdd`) |
Transform``1 | Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream. `func` can have one argument of `rdd`, or have two arguments of (`time`, `rdd`) |
TransformWith``2 | Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream and 'other' DStream. `func` can have two arguments of (`rdd_a`, `rdd_b`) or have three arguments of (`time`, `rdd_a`, `rdd_b`) |
TransformWith``2 | Return a new DStream in which each RDD is generated by applying a function on each RDD of this DStream and 'other' DStream. `func` can have two arguments of (`rdd_a`, `rdd_b`) or have three arguments of (`time`, `rdd_a`, `rdd_b`) |
Repartition | Return a new DStream with an increased or decreased level of parallelism. |
Union | Return a new DStream by unifying data of another DStream with this DStream. @param other: Another DStream having the same interval (i.e., slideDuration) as this DStream. |
Slice | Return all the RDDs between 'fromTime' to 'toTime' (both included) |
Window | Return a new DStream in which each RDD contains all the elements in seen in a sliding window of time over this DStream. @param windowDuration: width of the window; must be a multiple of this DStream's batching interval @param slideDuration: sliding interval of the window (i.e., the interval after which the new DStream will generate RDDs); must be a multiple of this DStream's batching interval |
ReduceByWindow | Return a new DStream in which each RDD has a single element generated by reducing all elements in a sliding window over this DStream. if `invReduceFunc` is not None, the reduction is done incrementally using the old window's reduced value : 1. reduce the new values that entered the window (e.g., adding new counts) 2. "inverse reduce" the old values that left the window (e.g., subtracting old counts) This is more efficient than `invReduceFunc` is None. |
CountByWindow | Return a new DStream in which each RDD has a single element generated by counting the number of elements in a window over this DStream. windowDuration and slideDuration are as defined in the window() operation. This is equivalent to window(windowDuration, slideDuration).count(), but will be more efficient if window is large. |
CountByValueAndWindow | Return a new DStream in which each RDD contains the count of distinct elements in RDDs in a sliding window over this DStream. |
###Microsoft.Spark.CSharp.Streaming.EventHubsUtils ####Summary
Utility for creating streams from
####Methods
Name | Description |
---|---|
CreateUnionStream | Create a unioned EventHubs stream that receives data from Microsoft Azure Eventhubs The unioned stream will receive message from all partitions of the EventHubs |
###Microsoft.Spark.CSharp.Streaming.KafkaUtils ####Summary
Utils for Kafka input stream.
####Methods
Name | Description |
---|---|
CreateStream | Create an input stream that pulls messages from a Kafka Broker. |
CreateStream | Create an input stream that pulls messages from a Kafka Broker. |
CreateDirectStream | Create an input stream that directly pulls messages from a Kafka Broker and specific offset. This is not a receiver based Kafka input stream, it directly pulls the message from Kafka in each batch duration and processed without storing. This does not use Zookeeper to store offsets. The consumed offsets are tracked by the stream itself. For interoperability with Kafka monitoring tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application. You can access the offsets used in each batch from the generated RDDs (see [[org.apache.spark.streaming.kafka.HasOffsetRanges]]). To recover from driver failures, you have to enable checkpointing in the StreamingContext. The information on consumed offset can be recovered from the checkpoint. See the programming guide for details (constraints, etc.). |
CreateDirectStream``1 | Create an input stream that directly pulls messages from a Kafka Broker and specific offset. This is not a receiver based Kafka input stream, it directly pulls the message from Kafka in each batch duration and processed without storing. This does not use Zookeeper to store offsets. The consumed offsets are tracked by the stream itself. For interoperability with Kafka monitoring tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application. You can access the offsets used in each batch from the generated RDDs (see [[org.apache.spark.streaming.kafka.HasOffsetRanges]]). To recover from driver failures, you have to enable checkpointing in the StreamingContext. The information on consumed offset can be recovered from the checkpoint. See the programming guide for details (constraints, etc.). |
GetOffsetRange | create offset range from kafka messages when CSharpReader is enabled |
GetNumPartitionsFromConfig | topics should contain only one topic if choose to repartitions to a configured numPartitions TODO: move to scala and merge into DynamicPartitionKafkaRDD.getPartitions to remove above limitation |
###Microsoft.Spark.CSharp.Streaming.OffsetRange ####Summary
Kafka offset range
####Methods
Name | Description |
---|---|
ToString | OffsetRange string format |
###Microsoft.Spark.CSharp.Streaming.MapWithStateDStream`4 ####Summary
DStream representing the stream of data generated by `mapWithState` operation on a pair DStream.
Additionally, it also gives access to the stream of state snapshots, that is, the state data of all keys after a batch has updated them.
Type of the key
Type of the value
Type of the state data
Type of the mapped data
####Methods
Name | Description |
---|---|
StateSnapshots | Return a pair DStream where each RDD is the snapshot of the state of all the keys. |
###Microsoft.Spark.CSharp.Streaming.KeyedState`1 ####Summary
Class to hold a state instance and the timestamp when the state is updated or created.
No need to explicitly make this class clonable, since the serialization and deserialization in Worker is already a kind of clone mechanism.
Type of the state data
###Microsoft.Spark.CSharp.Streaming.MapWithStateRDDRecord`3 ####Summary
Record storing the keyed-state MapWithStateRDD.
Each record contains a stateMap and a sequence of records returned by the mapping function of MapWithState.
Note: don't need to explicitly make this class clonable, since the serialization and deserialization in Worker is already a kind of clone.
Type of the key
Type of the state data
Type of the mapped data
###Microsoft.Spark.CSharp.Streaming.StateSpec`4 ####Summary
Representing all the specifications of the DStream transformation `mapWithState` operation.
Type of the key
Type of the value
Type of the state data
Type of the mapped data
####Methods
Name | Description |
---|---|
NumPartitions | Set the number of partitions by which the state RDDs generated by `mapWithState` will be partitioned. Hash partitioning will be used. |
Timeout | Set the duration after which the state of an idle key will be removed. A key and its state is considered idle if it has not received any data for at least the given duration. The mapping function will be called one final time on the idle states that are going to be removed; [[org.apache.spark.streaming.State State.isTimingOut()]] set to `true` in that call. |
InitialState | Set the RDD containing the initial states that will be used by mapWithState |
###Microsoft.Spark.CSharp.Streaming.State`1 ####Summary
class for getting and updating the state in mapping function used in the `mapWithState` operation
Type of the state
####Methods
Name | Description |
---|---|
Exists | Returns whether the state already exists |
Get | Gets the state if it exists, otherwise it will throw ArgumentException. |
Update | Updates the state with a new value. |
Remove | Removes the state if it exists. |
IsTimingOut | Returns whether the state is timing out and going to be removed by the system after the current batch. |
###Microsoft.Spark.CSharp.Streaming.PairDStreamFunctions ####Summary
operations only available to KeyValuePair RDD
####Methods
Name | Description |
---|---|
ReduceByKey``2 | Return a new DStream by applying ReduceByKey to each RDD. |
CombineByKey``3 | Return a new DStream by applying combineByKey to each RDD. |
PartitionBy``2 | Return a new DStream in which each RDD are partitioned by numPartitions. |
MapValues``3 | Return a new DStream by applying a map function to the value of each key-value pairs in this DStream without changing the key. |
FlatMapValues``3 | Return a new DStream by applying a flatmap function to the value of each key-value pairs in this DStream without changing the key. |
GroupByKey``2 | Return a new DStream by applying groupByKey on each RDD. |
GroupWith``3 | Return a new DStream by applying 'cogroup' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions. |
Join``3 | Return a new DStream by applying 'join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions. |
LeftOuterJoin``3 | Return a new DStream by applying 'left outer join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions. |
RightOuterJoin``3 | Return a new DStream by applying 'right outer join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions. |
FullOuterJoin``3 | Return a new DStream by applying 'full outer join' between RDDs of this DStream and `other` DStream. Hash partitioning is used to generate the RDDs with `numPartitions` partitions. |
GroupByKeyAndWindow``2 | Return a new DStream by applying `GroupByKey` over a sliding window. Similar to `DStream.GroupByKey()`, but applies it over a sliding window. |
ReduceByKeyAndWindow``2 | Return a new DStream by applying incremental `reduceByKey` over a sliding window. The reduced value of over a new window is calculated using the old window's reduce value : 1. reduce the new values that entered the window (e.g., adding new counts) 2. "inverse reduce" the old values that left the window (e.g., subtracting old counts) `invFunc` can be None, then it will reduce all the RDDs in window, could be slower than having `invFunc`. |
UpdateStateByKey``3 | Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key. |
UpdateStateByKey``3 | Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key. |
UpdateStateByKey``3 | Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key. |
MapWithState``4 | Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values of the key. |
###Microsoft.Spark.CSharp.Streaming.CSharpInputDStreamUtils ####Summary
Utils for csharp input stream.
####Methods
Name | Description |
---|---|
CreateStream``1 | Create an input stream that user can control the data injection by C# code |
CreateStream``1 | Create an input stream that user can control the data injection by C# code |
###Microsoft.Spark.CSharp.Streaming.StreamingContext ####Summary
Main entry point for Spark Streaming functionality. It provides methods used to create
[[org.apache.spark.streaming.dstream.DStream]]s from various input sources. It can be either
created by providing a Spark master URL and an appName, or from a org.apache.spark.SparkConf
configuration (see core Spark documentation), or from an existing org.apache.spark.SparkContext.
The associated SparkContext can be accessed using `context.sparkContext`. After
creating and transforming DStreams, the streaming computation can be started and stopped
using `context.start()` and `context.stop()`, respectively.
`context.awaitTermination()` allows the current thread to wait for the termination
of the context by `stop()` or by an exception.
####Methods
Name | Description |
---|---|
GetOrCreate | Either recreate a StreamingContext from checkpoint data or create a new StreamingContext. If checkpoint data exists in the provided `checkpointPath`, then StreamingContext will be recreated from the checkpoint data. If the data does not exist, then the provided setupFunc will be used to create a JavaStreamingContext. |
Start | Start the execution of the streams. |
Stop | Stop the execution of the streams. |
Remember | Set each DStreams in this context to remember RDDs it generated in the last given duration. DStreams remember RDDs only for a limited duration of time and releases them for garbage collection. This method allows the developer to specify how long to remember the RDDs ( if the developer wishes to query old data outside the DStream computation). |
Checkpoint | Set the context to periodically checkpoint the DStream operations for driver fault-tolerance. |
SocketTextStream | Create an input from TCP source hostname:port. Data is received using a TCP socket and receive byte is interpreted as UTF8 encoded ``\\n`` delimited lines. |
TextFileStream | Create an input stream that monitors a Hadoop-compatible file system for new files and reads them as text files. Files must be wrriten to the monitored directory by "moving" them from another location within the same file system. File names starting with . are ignored. |
AwaitTermination | Wait for the execution to stop. |
AwaitTerminationOrTimeout | Wait for the execution to stop. |
Transform``1 | Create a new DStream in which each RDD is generated by applying a function on RDDs of the DStreams. The order of the JavaRDDs in the transform function parameter will be the same as the order of corresponding DStreams in the list. |
Union``1 | Create a unified DStream from multiple DStreams of the same type and same slide duration. |
###Microsoft.Spark.CSharp.Streaming.TransformedDStream`1 ####Summary
TransformedDStream is an DStream generated by an C# function
transforming each RDD of an DStream to another RDDs.
Multiple continuous transformations of DStream can be combined into
one transformation.