-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrames: what's inside? #1119
Comments
Thanks for writing this summary. I'd be inclined to take option 2, but option 1 has the advantage of simplicity. Regarding the column interface, |
I like option 2. Better support of |
I like option 2 as well. I definitely think it is good for users to be able to make columns either By the way, at the moment it really isn't at all apparent how to make a column a |
What, specifically, are the objectives against which these approaches are being evaluated? |
@davidagold, the motivation here is the inconsistency between certain DataFrames constructors allowing a user to input anything they want as a "column" vs. other, more opinionated constructors that force conversion to NullableArray/CategoricalArray. And finally, the fact that many current DataFrame operations rely on certain methods being defined on a "column", yet this implicit interface is not formally acknowledged or documented anywhere. The goal with this issue would be more explicit documentation of what's expected for a DataFrame "column". |
I should have been clearer. I think I meant to ask:
People are voicing support for option 2, which suggests that an implicit objective is to support It seems the end goals of this design decision extend beyond documentation, since otherwise we could just, well, write documentation =p I'm just trying to figure out how to weigh these different options. |
I would like to be able to assign a distributed or out of core vector to a df column. Edit: I would assume that would also mean minimizing use of the iteration interface since use of such would likely be slow in non in-memory datastructures. Though, I think @shashi and @MikeInnes were working on an interface for writing memory storage location generic iterative code? Maybe this is a good motivating case. |
I'm not sure DataFrame is the best structure to work with distributed data. Since it does not include any specific support for this kind of vector, it won't be able to optimize operations like a dedicated structure would. Or the package would need to be improved to handle this use case specifically. @quinnj What did you have in mind when you mentioned this? |
Right, but i was thinking it could define fallback abstract methods that a "BigDataframe" package could overload. |
Then that's a completely different issue. This one is really about the |
From perspective of a user of The benefit of Then, I would consider to change the definition of
to enforce this restriction. |
I wanted to ask what is the current state of affairs on this issue. I've just tried out DataFrames master and I haven't been able to convert columns to regular arrays (i.e. columns get automatically converted to nullablearrays and I'm not able to convert them back). Does it mean that option 1) has been chosen or things are still being decided/implemented? |
I think most people agree that it should be possible to use any array type as columns. Somebody just needs to make the relevant changes. For now, you can use this constructor to preserve the array type if you want. |
Closing since DataFrames no longer auto-promotes column types. |
Up till this point, and even since the internal overhaul to NullableArrays/CategoricalArrays, it's been a little ambiguous and "hard-coded" with what a user could put in a DataFrame; i.e. how the columns of a DataFrame were represented. In some constructors, whatever the user provided is simply inserted and used, while other constructors go through the
upgrade_vector!
code path which will do a translation fromAbstractArray
=>NullableArray
.There have been various discussions that have hinted at some of these issues, but I wanted to formally address them and propose three potential ways forward:
AbstractColumn
interface, i.e. a set of methods that would be required by any type a user wished to use as a DataFrame column storage. All other DataFrames functionality would then be solely based off of these official interface methods. DataFrames would still REQUIRE NullableArrays/CategoricalArrays to provide reasonable defaults, butupgrade_vector!
would be eliminated. Benefits: Users can use any type they want, probably most-often regularVector{T}
s. It also provides a looser coupling with the NA/CA packages, allowing for a future when potentially more efficient "column" types are developed or that provide custom functionality (i.e. out-of-core, distributed columns, database-backed columns, etc.).Anyway, I've started documenting the actual methods that are relied upon in DataFrames for the "column" types we put in, for future reference:
The text was updated successfully, but these errors were encountered: