Skip to content

Commit 181f217

Browse files
committed
migration guide, remove old language guides
1 parent e11a0da commit 181f217

File tree

4 files changed

+53
-383
lines changed

4 files changed

+53
-383
lines changed

docs/java-programming-guide.md

Lines changed: 2 additions & 213 deletions
Original file line numberDiff line numberDiff line change
@@ -1,218 +1,7 @@
11
---
22
layout: global
33
title: Java Programming Guide
4+
redirect: programming-guide.html
45
---
56

6-
The Spark Java API exposes all the Spark features available in the Scala version to Java.
7-
To learn the basics of Spark, we recommend reading through the
8-
[Scala programming guide](programming-guide.html) first; it should be
9-
easy to follow even if you don't know Scala.
10-
This guide will show how to use the Spark features described there in Java.
11-
12-
The Spark Java API is defined in the
13-
[`org.apache.spark.api.java`](api/java/index.html?org/apache/spark/api/java/package-summary.html) package, and includes
14-
a [`JavaSparkContext`](api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html) for
15-
initializing Spark and [`JavaRDD`](api/java/index.html?org/apache/spark/api/java/JavaRDD.html) classes,
16-
which support the same methods as their Scala counterparts but take Java functions and return
17-
Java data and collection types. The main differences have to do with passing functions to RDD
18-
operations (e.g. map) and handling RDDs of different types, as discussed next.
19-
20-
# Key Differences in the Java API
21-
22-
There are a few key differences between the Java and Scala APIs:
23-
24-
* Java does not support anonymous or first-class functions, so functions are passed
25-
using anonymous classes that implement the
26-
[`org.apache.spark.api.java.function.Function`](api/java/index.html?org/apache/spark/api/java/function/Function.html),
27-
[`Function2`](api/java/index.html?org/apache/spark/api/java/function/Function2.html), etc.
28-
interfaces.
29-
* To maintain type safety, the Java API defines specialized Function and RDD
30-
classes for key-value pairs and doubles. For example,
31-
[`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)
32-
stores key-value pairs.
33-
* Some methods are defined on the basis of the passed function's return type.
34-
For example `mapToPair()` returns
35-
[`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html),
36-
and `mapToDouble()` returns
37-
[`JavaDoubleRDD`](api/java/index.html?org/apache/spark/api/java/JavaDoubleRDD.html).
38-
* RDD methods like `collect()` and `countByKey()` return Java collections types,
39-
such as `java.util.List` and `java.util.Map`.
40-
* Key-value pairs, which are simply written as `(key, value)` in Scala, are represented
41-
by the `scala.Tuple2` class, and need to be created using `new Tuple2<K, V>(key, value)`.
42-
43-
## RDD Classes
44-
45-
Spark defines additional operations on RDDs of key-value pairs and doubles, such
46-
as `reduceByKey`, `join`, and `stdev`.
47-
48-
In the Scala API, these methods are automatically added using Scala's
49-
[implicit conversions](http://www.scala-lang.org/node/130) mechanism.
50-
51-
In the Java API, the extra methods are defined in the
52-
[`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)
53-
and [`JavaDoubleRDD`](api/java/index.html?org/apache/spark/api/java/JavaDoubleRDD.html)
54-
classes. RDD methods like `map` are overloaded by specialized `PairFunction`
55-
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
56-
types. Common methods like `filter` and `sample` are implemented by
57-
each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`,
58-
etc (this achieves the "same-result-type" principle used by the [Scala collections
59-
framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)).
60-
61-
## Function Interfaces
62-
63-
The following table lists the function interfaces used by the Java API, located in the
64-
[`org.apache.spark.api.java.function`](api/java/index.html?org/apache/spark/api/java/function/package-summary.html)
65-
package. Each interface has a single abstract method, `call()`.
66-
67-
<table class="table">
68-
<tr><th>Class</th><th>Function Type</th></tr>
69-
70-
<tr><td>Function&lt;T, R&gt;</td><td>T =&gt; R </td></tr>
71-
<tr><td>DoubleFunction&lt;T&gt;</td><td>T =&gt; Double </td></tr>
72-
<tr><td>PairFunction&lt;T, K, V&gt;</td><td>T =&gt; Tuple2&lt;K, V&gt; </td></tr>
73-
74-
<tr><td>FlatMapFunction&lt;T, R&gt;</td><td>T =&gt; Iterable&lt;R&gt; </td></tr>
75-
<tr><td>DoubleFlatMapFunction&lt;T&gt;</td><td>T =&gt; Iterable&lt;Double&gt; </td></tr>
76-
<tr><td>PairFlatMapFunction&lt;T, K, V&gt;</td><td>T =&gt; Iterable&lt;Tuple2&lt;K, V&gt;&gt; </td></tr>
77-
78-
<tr><td>Function2&lt;T1, T2, R&gt;</td><td>T1, T2 =&gt; R (function of two arguments)</td></tr>
79-
</table>
80-
81-
## Storage Levels
82-
83-
RDD [storage level](programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are
84-
declared in the [org.apache.spark.api.java.StorageLevels](api/java/index.html?org/apache/spark/api/java/StorageLevels.html) class. To
85-
define your own storage level, you can use StorageLevels.create(...).
86-
87-
# Other Features
88-
89-
The Java API supports other Spark features, including
90-
[accumulators](programming-guide.html#accumulators),
91-
[broadcast variables](programming-guide.html#broadcast-variables), and
92-
[caching](programming-guide.html#rdd-persistence).
93-
94-
# Upgrading From Pre-1.0 Versions of Spark
95-
96-
In version 1.0 of Spark the Java API was refactored to better support Java 8
97-
lambda expressions. Users upgrading from older versions of Spark should note
98-
the following changes:
99-
100-
* All `org.apache.spark.api.java.function.*` have been changed from abstract
101-
classes to interfaces. This means that concrete implementations of these
102-
`Function` classes will need to use `implements` rather than `extends`.
103-
* Certain transformation functions now have multiple versions depending
104-
on the return type. In Spark core, the map functions (`map`, `flatMap`, and
105-
`mapPartitions`) have type-specific versions, e.g.
106-
[`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction))
107-
and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
108-
Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
109-
110-
# Example
111-
112-
As an example, we will implement word count using the Java API.
113-
114-
{% highlight java %}
115-
import org.apache.spark.api.java.*;
116-
import org.apache.spark.api.java.function.*;
117-
118-
JavaSparkContext jsc = new JavaSparkContext(...);
119-
JavaRDD<String> lines = jsc.textFile("hdfs://...");
120-
JavaRDD<String> words = lines.flatMap(
121-
new FlatMapFunction<String, String>() {
122-
@Override public Iterable<String> call(String s) {
123-
return Arrays.asList(s.split(" "));
124-
}
125-
}
126-
);
127-
{% endhighlight %}
128-
129-
The word count program starts by creating a `JavaSparkContext`, which accepts
130-
the same parameters as its Scala counterpart. `JavaSparkContext` supports the
131-
same data loading methods as the regular `SparkContext`; here, `textFile`
132-
loads lines from text files stored in HDFS.
133-
134-
To split the lines into words, we use `flatMap` to split each line on
135-
whitespace. `flatMap` is passed a `FlatMapFunction` that accepts a string and
136-
returns an `java.lang.Iterable` of strings.
137-
138-
Here, the `FlatMapFunction` was created inline; another option is to subclass
139-
`FlatMapFunction` and pass an instance to `flatMap`:
140-
141-
{% highlight java %}
142-
class Split extends FlatMapFunction<String, String> {
143-
@Override public Iterable<String> call(String s) {
144-
return Arrays.asList(s.split(" "));
145-
}
146-
}
147-
JavaRDD<String> words = lines.flatMap(new Split());
148-
{% endhighlight %}
149-
150-
Java 8+ users can also write the above `FlatMapFunction` in a more concise way using
151-
a lambda expression:
152-
153-
{% highlight java %}
154-
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(s.split(" ")));
155-
{% endhighlight %}
156-
157-
This lambda syntax can be applied to all anonymous classes in Java 8.
158-
159-
Continuing with the word count example, we map each word to a `(word, 1)` pair:
160-
161-
{% highlight java %}
162-
import scala.Tuple2;
163-
JavaPairRDD<String, Integer> ones = words.mapToPair(
164-
new PairFunction<String, String, Integer>() {
165-
@Override public Tuple2<String, Integer> call(String s) {
166-
return new Tuple2<String, Integer>(s, 1);
167-
}
168-
}
169-
);
170-
{% endhighlight %}
171-
172-
Note that `mapToPair` was passed a `PairFunction<String, String, Integer>` and
173-
returned a `JavaPairRDD<String, Integer>`.
174-
175-
To finish the word count program, we will use `reduceByKey` to count the
176-
occurrences of each word:
177-
178-
{% highlight java %}
179-
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
180-
new Function2<Integer, Integer, Integer>() {
181-
@Override public Integer call(Integer i1, Integer i2) {
182-
return i1 + i2;
183-
}
184-
}
185-
);
186-
{% endhighlight %}
187-
188-
Here, `reduceByKey` is passed a `Function2`, which implements a function with
189-
two arguments. The resulting `JavaPairRDD` contains `(word, count)` pairs.
190-
191-
In this example, we explicitly showed each intermediate RDD. It is also
192-
possible to chain the RDD transformations, so the word count example could also
193-
be written as:
194-
195-
{% highlight java %}
196-
JavaPairRDD<String, Integer> counts = lines.flatMapToPair(
197-
...
198-
).map(
199-
...
200-
).reduceByKey(
201-
...
202-
);
203-
{% endhighlight %}
204-
205-
There is no performance difference between these approaches; the choice is
206-
just a matter of style.
207-
208-
# API Docs
209-
210-
[API documentation](api/java/index.html) for Spark in Java is available in Javadoc format.
211-
212-
# Where to Go from Here
213-
214-
Spark includes several sample programs using the Java API in
215-
[`examples/src/main/java`](https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples). You can run them by passing the class name to the
216-
`bin/run-example` script included in Spark; for example:
217-
218-
./bin/run-example JavaWordCount README.md
7+
This document has been merged into the [Spark programming guide](programming-guide.html).

docs/mllib-guide.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ MLlib is a new component under active development.
3131
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
3232
and we will provide migration guide between releases.
3333

34-
## Dependencies
34+
# Dependencies
3535

3636
MLlib uses linear algebra packages [Breeze](http://www.scalanlp.org/), which depends on
3737
[netlib-java](https://github.com/fommil/netlib-java), and
@@ -50,9 +50,9 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4
5050

5151
---
5252

53-
## Migration guide
53+
# Migration Guide
5454

55-
### From 0.9 to 1.0
55+
## From 0.9 to 1.0
5656

5757
In MLlib v1.0, we support both dense and sparse input in a unified way, which introduces a few
5858
breaking changes. If your data is sparse, please store it in a sparse format instead of dense to

docs/programming-guide.md

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1249,14 +1249,56 @@ cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
12491249
</table>
12501250

12511251

1252+
# Migrating from pre-1.0 Versions of Spark
1253+
1254+
<div class="codetabs">
1255+
1256+
<div data-lang="scala" markdown="1">
1257+
1258+
Spark 1.0 freezes the API of Spark Core for the 1.X series, in that any API available today that is
1259+
not marked "experimental" or "developer API" will be supported in future versions.
1260+
The only change for Scala users is that the grouping operations, e.g. `groupByKey`, `cogroup` and `join`,
1261+
have changed from returning `(Key, Seq[Value])` pairs to `(Key, Iterable[Value])`.
1262+
1263+
</div>
1264+
1265+
<div data-lang="java" markdown="1">
1266+
1267+
Spark 1.0 freezes the API of Spark Core for the 1.X series, in that any API available today that is
1268+
not marked "experimental" or "developer API" will be supported in future versions.
1269+
Several changes were made to the Java API:
1270+
1271+
* The Function classes in `org.apache.spark.api.java.function` became interfaces in 1.0, meaning that old
1272+
code that `extends Function` should `implement Function` instead.
1273+
* New variants of the `map` transformations, like `mapToPair` and `mapToDouble`, were added to create RDDs
1274+
of special data types.
1275+
* Grouping operations like `groupByKey`, `cogroup` and `join` have changed from returning
1276+
`(Key, List<Value>)` pairs to `(Key, Iterable<Value>)`.
1277+
1278+
</div>
1279+
1280+
<div data-lang="python" markdown="1">
1281+
1282+
Spark 1.0 freezes the API of Spark Core for the 1.X series, in that any API available today that is
1283+
not marked "experimental" or "developer API" will be supported in future versions.
1284+
The only change for Python users is that the grouping operations, e.g. `groupByKey`, `cogroup` and `join`,
1285+
have changed from returning (key, list of values) pairs to (key, iterable of values).
1286+
1287+
</div>
1288+
1289+
</div>
1290+
1291+
Migration guides are also available for [Spark Streaming](streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x)
1292+
and [MLlib](mllib-guide.html#migration-guide).
1293+
1294+
12521295
# Where to Go from Here
12531296

12541297
You can see some [example Spark programs](http://spark.apache.org/examples.html) on the Spark website.
12551298
In addition, Spark includes several samples in the `examples` directory
12561299
([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
12571300
[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
12581301
[Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python)).
1259-
Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what was changed to make the program run on a cluster.
12601302
You can run Java and Scala examples by passing the class name to Spark's `bin/run-example` script; for instance:
12611303

12621304
./bin/run-example SparkPi
@@ -1270,3 +1312,6 @@ For help on optimizing your programs, the [configuration](configuration.html) an
12701312
making sure that your data is stored in memory in an efficient format.
12711313
For help on deploying, the [cluster mode overview](cluster-overview.html) describes the components involved
12721314
in distributed operation and supported cluster managers.
1315+
1316+
Finally, full API documentation is available in
1317+
[Scala](api/scala/#org.apache.spark.package), [Java](api/java/) and [Python](api/python/).

0 commit comments

Comments
 (0)