-
Notifications
You must be signed in to change notification settings - Fork 229
[WIP] Converts dataframe to/from named numpy arrays #4
base: master
Are you sure you want to change the base?
Conversation
@@ -161,3 +163,84 @@ def toScipy(self, X): | |||
else: | |||
raise TypeError("Converter.toScipy expected numpy.ndarray of" | |||
" scipy.sparse.csr.csr_matrix instances, but found: %s" % type(X)) | |||
|
|||
@staticmethod | |||
def _analyze_element(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this will be very slow for larger data? That's OK for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will; we can always improve it later.
return arr | ||
|
||
def pack_DataFrame(self, **kwargs): | ||
""" Converts a set of numpy arrays into a single dataframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs should list the supported input types and how they are handled: lists of common types, or lists of vector or numerical array types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I also added that we support a subset of numpy types (there are so many) and sql types.
Just had a couple more comments. |
@jkbradley comments addressed |
This PR shoul unskip the following: test_cv_lasso_with_mllib_featurization (spark_sklearn.tests.test_grid_search_2.CVTests) ... SKIP: disable this test until we have numpy <-> dataframe conversion |
I'm starting to look through the open PRs to see if we can merge them or whether they're stale -- @thunterdb is this one too old to resurrect? |
I found this incredibly convenient to create small dataframes, here is how you can use it:
And the other conversion. It properly extracts the proper shape for vectors, matrices, etc.
Currently missing are more tests, better names, and sparse vectors. Not sure how easy it is to support these because they have an irregular shape between row. It is probably easier to prevent it and force users to use the CSC conversion that you already wrote.