[WIP] Converts dataframe to/from named numpy arrays #4

thunterdb · 2015-12-02T00:23:19Z

I found this incredibly convenient to create small dataframes, here is how you can use it:

n = 5
A = rd.rand(n,4)
C = rd.randint(10, size=n)
df = conv.pack_DataFrame(a=A, c=C)

DataFrame[a: vector, c: bigint]

And the other conversion. It properly extracts the proper shape for vectors, matrices, etc.

Z = Converter.df_to_numpy(df)
# Each column is strictly equal to the original.
Z['a'] == A
Z['c'] == C

Currently missing are more tests, better names, and sparse vectors. Not sure how easy it is to support these because they have an irregular shape between row. It is probably easier to prevent it and force users to use the CSC conversion that you already wrote.

jkbradley · 2015-12-02T00:49:18Z

python/pdspark/converter.py

@@ -161,3 +163,84 @@ def toScipy(self, X):
        else:
            raise TypeError("Converter.toScipy expected numpy.ndarray of"
                            " scipy.sparse.csr.csr_matrix instances, but found: %s" % type(X))
+
+    @staticmethod
+    def _analyze_element(x):


I assume this will be very slow for larger data? That's OK for now.

Yes it will; we can always improve it later.

jkbradley · 2015-12-16T18:00:16Z

python/pdspark/converter.py

+      return arr
+
+    def pack_DataFrame(self, **kwargs):
+      """ Converts a set of numpy arrays into a single dataframe.


The docs should list the supported input types and how they are handled: lists of common types, or lists of vector or numerical array types.

Done. I also added that we support a subset of numpy types (there are so many) and sql types.

jkbradley · 2015-12-16T18:00:47Z

Just had a couple more comments.

thunterdb · 2015-12-21T21:54:20Z

@jkbradley comments addressed

vlad17 · 2016-06-28T00:22:02Z

This PR shoul unskip the following: test_cv_lasso_with_mllib_featurization (spark_sklearn.tests.test_grid_search_2.CVTests) ... SKIP: disable this test until we have numpy <-> dataframe conversion

srowen · 2018-12-07T21:08:03Z

I'm starting to look through the open PRs to see if we can merge them or whether they're stale -- @thunterdb is this one too old to resurrect?

work

5babe88

jkbradley reviewed Dec 2, 2015
View reviewed changes

comments

41ed092

jkbradley reviewed Dec 16, 2015
View reviewed changes

thunterdb added 3 commits December 21, 2015 13:32

changes

042f54d

comments

ac0a0b6

removing old code

38ae150

srowen added the enhancement label Dec 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Converts dataframe to/from named numpy arrays #4

[WIP] Converts dataframe to/from named numpy arrays #4

thunterdb commented Dec 2, 2015

jkbradley Dec 2, 2015

thunterdb Dec 2, 2015

jkbradley Dec 16, 2015

thunterdb Dec 21, 2015

jkbradley commented Dec 16, 2015

thunterdb commented Dec 21, 2015

vlad17 commented Jun 28, 2016

srowen commented Dec 7, 2018

[WIP] Converts dataframe to/from named numpy arrays #4

Are you sure you want to change the base?

[WIP] Converts dataframe to/from named numpy arrays #4

Conversation

thunterdb commented Dec 2, 2015

jkbradley Dec 2, 2015

Choose a reason for hiding this comment

thunterdb Dec 2, 2015

Choose a reason for hiding this comment

jkbradley Dec 16, 2015

Choose a reason for hiding this comment

thunterdb Dec 21, 2015

Choose a reason for hiding this comment

jkbradley commented Dec 16, 2015

thunterdb commented Dec 21, 2015

vlad17 commented Jun 28, 2016

srowen commented Dec 7, 2018