Skip to content

Commit cc00e99

Browse files
committed
[SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation.
## What changes were proposed in this pull request? Update the Quickstart and RDD programming guides to mention pip. ## How was this patch tested? Built docs locally. Author: Holden Karau <holden@us.ibm.com> Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation.
1 parent 113399b commit cc00e99

File tree

2 files changed

+38
-2
lines changed

2 files changed

+38
-2
lines changed

docs/quick-start.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,11 @@ res3: Long = 15
6666

6767
./bin/pyspark
6868

69+
70+
Or if PySpark is installed with pip in your current enviroment:
71+
72+
pyspark
73+
6974
Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python's dynamic nature, we don't need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let's make a new DataFrame from the text of the README file in the Spark source directory:
7075

7176
{% highlight python %}
@@ -206,7 +211,7 @@ a cluster, as described in the [RDD programming guide](rdd-programming-guide.htm
206211

207212
# Self-Contained Applications
208213
Suppose we wish to write a self-contained application using the Spark API. We will walk through a
209-
simple application in Scala (with sbt), Java (with Maven), and Python.
214+
simple application in Scala (with sbt), Java (with Maven), and Python (pip).
210215

211216
<div class="codetabs">
212217
<div data-lang="scala" markdown="1">
@@ -367,6 +372,16 @@ Lines with a: 46, Lines with b: 23
367372

368373
Now we will show how to write an application using the Python API (PySpark).
369374

375+
376+
If you are building a packaged PySpark application or library you can add it to your setup.py file as:
377+
378+
{% highlight python %}
379+
install_requires=[
380+
'pyspark=={site.SPARK_VERSION}'
381+
]
382+
{% endhighlight %}
383+
384+
370385
As an example, we'll create a simple Spark application, `SimpleApp.py`:
371386

372387
{% highlight python %}
@@ -406,6 +421,16 @@ $ YOUR_SPARK_HOME/bin/spark-submit \
406421
Lines with a: 46, Lines with b: 23
407422
{% endhighlight %}
408423

424+
If you have PySpark pip installed into your enviroment (e.g. `pip instal pyspark` you can run your application with the regular Python interpeter or use the provided spark-submit as you prefer.
425+
426+
{% highlight bash %}
427+
# Use spark-submit to run your application
428+
$ python SimpleApp.py
429+
...
430+
Lines with a: 46, Lines with b: 23
431+
{% endhighlight %}
432+
433+
409434
</div>
410435
</div>
411436

docs/rdd-programming-guide.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,18 @@ import org.apache.spark.SparkConf;
8989
Spark {{site.SPARK_VERSION}} works with Python 2.7+ or Python 3.4+. It can use the standard CPython interpreter,
9090
so C libraries like NumPy can be used. It also works with PyPy 2.3+.
9191

92-
To run Spark applications in Python, use the `bin/spark-submit` script located in the Spark directory.
92+
Python 2.6 support was removed in Spark 2.2.0.
93+
94+
Spark applications in Python can either be run with the `bin/spark-submit` script which includes Spark at runtime, or by including including it in your setup.py as:
95+
96+
{% highlight python %}
97+
install_requires=[
98+
'pyspark=={site.SPARK_VERSION}'
99+
]
100+
{% endhighlight %}
101+
102+
103+
To run Spark applications in Python without pip installing PySpark, use the `bin/spark-submit` script located in the Spark directory.
93104
This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster.
94105
You can also use `bin/pyspark` to launch an interactive Python shell.
95106

0 commit comments

Comments
 (0)