You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## What changes were proposed in this pull request?
Update the Quickstart and RDD programming guides to mention pip.
## How was this patch tested?
Built docs locally.
Author: Holden Karau <holden@us.ibm.com>
Closes#18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation.
Copy file name to clipboardExpand all lines: docs/quick-start.md
+26-1Lines changed: 26 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,6 +66,11 @@ res3: Long = 15
66
66
67
67
./bin/pyspark
68
68
69
+
70
+
Or if PySpark is installed with pip in your current enviroment:
71
+
72
+
pyspark
73
+
69
74
Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python's dynamic nature, we don't need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let's make a new DataFrame from the text of the README file in the Spark source directory:
70
75
71
76
{% highlight python %}
@@ -206,7 +211,7 @@ a cluster, as described in the [RDD programming guide](rdd-programming-guide.htm
206
211
207
212
# Self-Contained Applications
208
213
Suppose we wish to write a self-contained application using the Spark API. We will walk through a
209
-
simple application in Scala (with sbt), Java (with Maven), and Python.
214
+
simple application in Scala (with sbt), Java (with Maven), and Python (pip).
210
215
211
216
<divclass="codetabs">
212
217
<divdata-lang="scala"markdown="1">
@@ -367,6 +372,16 @@ Lines with a: 46, Lines with b: 23
367
372
368
373
Now we will show how to write an application using the Python API (PySpark).
369
374
375
+
376
+
If you are building a packaged PySpark application or library you can add it to your setup.py file as:
377
+
378
+
{% highlight python %}
379
+
install_requires=[
380
+
'pyspark=={site.SPARK_VERSION}'
381
+
]
382
+
{% endhighlight %}
383
+
384
+
370
385
As an example, we'll create a simple Spark application, `SimpleApp.py`:
If you have PySpark pip installed into your enviroment (e.g. `pip instal pyspark` you can run your application with the regular Python interpeter or use the provided spark-submit as you prefer.
Spark {{site.SPARK_VERSION}} works with Python 2.7+ or Python 3.4+. It can use the standard CPython interpreter,
90
90
so C libraries like NumPy can be used. It also works with PyPy 2.3+.
91
91
92
-
To run Spark applications in Python, use the `bin/spark-submit` script located in the Spark directory.
92
+
Python 2.6 support was removed in Spark 2.2.0.
93
+
94
+
Spark applications in Python can either be run with the `bin/spark-submit` script which includes Spark at runtime, or by including including it in your setup.py as:
95
+
96
+
{% highlight python %}
97
+
install_requires=[
98
+
'pyspark=={site.SPARK_VERSION}'
99
+
]
100
+
{% endhighlight %}
101
+
102
+
103
+
To run Spark applications in Python without pip installing PySpark, use the `bin/spark-submit` script located in the Spark directory.
93
104
This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster.
94
105
You can also use `bin/pyspark` to launch an interactive Python shell.
0 commit comments