All the labs have been tested on PySpark 2.1.0 Reference labs are based on Spark 1.6.1
-
'DataFrame' object has no attribute 'map'
mydf.map() --> mydf.rdd.map()
-
When using regression functions from
pyspark.ml.classification
such asLogisticRegression
, column features of input data must be of typeorg.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
instead oforg.apache.spark.mllib.linalg.VectorUDT@f71b0bce
. -
It sometimes throws an error that numpy is required using Spark with yarn,, but using with local mode has no such problem. E.g. CS120/lab2 running
df.rdd.map(lambda x: LabeledPoint(x[0].split(",")[0], x[0].split(",")[1:])).toDF()
on Spark yarn will get an error that numpy is required.