Skip to content

Latest commit

 

History

History
11 lines (7 loc) · 717 Bytes

diff.md

File metadata and controls

11 lines (7 loc) · 717 Bytes

All the labs have been tested on PySpark 2.1.0 Reference labs are based on Spark 1.6.1

  1. 'DataFrame' object has no attribute 'map' mydf.map() --> mydf.rdd.map()

  2. When using regression functions from pyspark.ml.classification such as LogisticRegression, column features of input data must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 instead of org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.

  3. It sometimes throws an error that numpy is required using Spark with yarn,, but using with local mode has no such problem. E.g. CS120/lab2 running df.rdd.map(lambda x: LabeledPoint(x[0].split(",")[0], x[0].split(",")[1:])).toDF() on Spark yarn will get an error that numpy is required.