-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-11497] [MLlib] [Python] PySpark RowMatrix Constructor Has Type Erasure Issue #9458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #44992 has finished for PR 9458 at commit
|
|
@mengxr This PR is ready for review and discussion. |
|
Note: The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0: from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
mat = RowMatrix(rows)
mat._java_matrix_wrapper.call("tallSkinnyQR", True)Should result in the following exception: |
|
Sorry for the long delay! I looked through the discussions, and this LGTM. I'll run tests once before merging it. |
|
Test build #2207 has finished for PR 9458 at commit
|
|
Merged with master, branch-1.6 and branch-1.5 |
…rasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b82203) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
…rasure Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue. (cherry picked from commit 1b82203) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
|
Great, thanks @jkbradley! |
As noted in PR #9441, implementing
tallSkinnyQRuncovered a bug with our PySparkRowMatrixconstructor. As discussed on the dev list here, there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct aRowMatrixfrom anRDD[Vector]in PythonMLlibAPI, theVectortype is erased, resulting in anRDD[Object]. Thus, when calling Scala'stallSkinnyQRfrom PySpark, we get a JavaClassCastExceptionin which anObjectcannot be cast to a SparkVector. As noted in the aforementioned dev list thread, this issue was also encountered withDecisionTrees, and the fix involved an explicitretagof the RDD with aVectortype.IndexedRowMatrixandCoordinateMatrixdo not appear to have this issue likely due to their related helper functions inPythonMLlibAPIcreating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.This PR currently contains that retagging fix applied to the
createRowMatrixhelper function inPythonMLlibAPI. This PR blocks #9441, so once this is merged, the other can be rebased.cc @holdenk