Skip to content
This repository has been archived by the owner on Dec 4, 2019. It is now read-only.

[WIP] Converts dataframe to/from named numpy arrays #4

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

thunterdb
Copy link
Contributor

I found this incredibly convenient to create small dataframes, here is how you can use it:

n = 5
A = rd.rand(n,4)
C = rd.randint(10, size=n)
df = conv.pack_DataFrame(a=A, c=C)

DataFrame[a: vector, c: bigint]

And the other conversion. It properly extracts the proper shape for vectors, matrices, etc.

Z = Converter.df_to_numpy(df)
# Each column is strictly equal to the original.
Z['a'] == A
Z['c'] == C

Currently missing are more tests, better names, and sparse vectors. Not sure how easy it is to support these because they have an irregular shape between row. It is probably easier to prevent it and force users to use the CSC conversion that you already wrote.

@@ -161,3 +163,84 @@ def toScipy(self, X):
else:
raise TypeError("Converter.toScipy expected numpy.ndarray of"
" scipy.sparse.csr.csr_matrix instances, but found: %s" % type(X))

@staticmethod
def _analyze_element(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this will be very slow for larger data? That's OK for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will; we can always improve it later.

return arr

def pack_DataFrame(self, **kwargs):
""" Converts a set of numpy arrays into a single dataframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs should list the supported input types and how they are handled: lists of common types, or lists of vector or numerical array types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I also added that we support a subset of numpy types (there are so many) and sql types.

@jkbradley
Copy link
Contributor

Just had a couple more comments.

@thunterdb
Copy link
Contributor Author

@jkbradley comments addressed

@vlad17
Copy link
Contributor

vlad17 commented Jun 28, 2016

This PR shoul unskip the following: test_cv_lasso_with_mllib_featurization (spark_sklearn.tests.test_grid_search_2.CVTests) ... SKIP: disable this test until we have numpy <-> dataframe conversion

@srowen
Copy link
Collaborator

srowen commented Dec 7, 2018

I'm starting to look through the open PRs to see if we can merge them or whether they're stale -- @thunterdb is this one too old to resurrect?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants