Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames: what's inside? #1119

Closed
quinnj opened this issue Nov 2, 2016 · 14 comments
Closed

DataFrames: what's inside? #1119

quinnj opened this issue Nov 2, 2016 · 14 comments

Comments

@quinnj
Copy link
Member

quinnj commented Nov 2, 2016

Up till this point, and even since the internal overhaul to NullableArrays/CategoricalArrays, it's been a little ambiguous and "hard-coded" with what a user could put in a DataFrame; i.e. how the columns of a DataFrame were represented. In some constructors, whatever the user provided is simply inserted and used, while other constructors go through the upgrade_vector! code path which will do a translation from AbstractArray => NullableArray.

There have been various discussions that have hinted at some of these issues, but I wanted to formally address them and propose three potential ways forward:

  1. Commit fully and solely to NullableArray/CategoricalArray internal storage; no other AbstractVector-type types would be allowed and errors would be thrown if attempted. Benefits: Consistent internal storage, opportunities to hand-optimize things by tightly coupling the DF/NA/CA packages.
  2. Develop an official AbstractColumn interface, i.e. a set of methods that would be required by any type a user wished to use as a DataFrame column storage. All other DataFrames functionality would then be solely based off of these official interface methods. DataFrames would still REQUIRE NullableArrays/CategoricalArrays to provide reasonable defaults, but upgrade_vector! would be eliminated. Benefits: Users can use any type they want, probably most-often regular Vector{T}s. It also provides a looser coupling with the NA/CA packages, allowing for a future when potentially more efficient "column" types are developed or that provide custom functionality (i.e. out-of-core, distributed columns, database-backed columns, etc.).
  3. There does exist a more radical third option, which would be to disallow any kind of "column" operations on DataFrames. E.g. You wouldn't be able to extract individual columns, or set individual columns. New "interaction" methods could be provided that relied solely on row-iterating (providing strongly typed tuples) or providing "apply" functions that could transform DataFrame columns in type-stable ways. With the maturation of the StructuredQuery.jl and Query.jl packages, there also exists other options for DataFrame manipulations which make this option even more viable. Perhaps this would be too breaking, but I think it's worth considering.

Anyway, I've started documenting the actual methods that are relied upon in DataFrames for the "column" types we put in, for future reference:

length(_) => Int
getindex(_, x::Real) => T
getindex(_, row_inds::Union{AbstractVector{T}, AbstractVector{Nullable{T}}}) => view
setindex!(_, row_ind::Real, v::Any)
setindex!(_, row_inds::AbstractVector{T<:Real}, v::Any)
deleteat!(_, ind::Union{Integer, UnitRange{Int}})
hcat!(df::DataFrame, _)
append!(_, _)
eltype(_)
push!(_, v::T)
pop!(_)
_nonnull!(res, _)
countnull(_)
_isnullable(_)  # not sure on this one
@nalimilan
Copy link
Member

Thanks for writing this summary. I'd be inclined to take option 2, but option 1 has the advantage of simplicity.

Regarding the column interface, _isnullable is not needed AFAIK: it's just an internal convenience method, defined as _isnullable{T}(::AbstractArray{T}) = T <: Nullable. So as long as we require columns to be AbstractArray (which I think we should do), we're fine. _nonnull! also has a reasonable fallback, so it only needs to be defined for performance. Same for countnull, which shouldn't even live in DataFrames. Finally, we hcat! methods are only there to promote columns to NullableArray.

@tshort
Copy link
Contributor

tshort commented Nov 2, 2016

I like option 2. Better support of Vectors and other column types would be useful.

@ExpandingMan
Copy link
Contributor

I like option 2 as well. I definitely think it is good for users to be able to make columns either NullableVector or Vector depending on what the case calls for, though obviously that presents some interface challenges. As a usage example: when I get to the point of actually doing something with my data, I usually have to do something about the nulls anyway, so at that point I'd much prefer to have all my columns behave just like ordinary Julia Vectors. I think it is crucial that dataframes are kept as "array-like" as possible, i.e. columns should be as interchangeable with Vectors as possible, and slices of dataframes should be easy to convert to and from rank-2 arrays. Julia has the nicest built-in (or add-in, for that matter) arrays of any language by a wide margin, so it would be nice if DataFrames could reflect that as much as possible.

By the way, at the moment it really isn't at all apparent how to make a column a Vector as opposed to a NullableVector. Certain constructors do it, but none of it is obvious.

@davidagold
Copy link

What, specifically, are the objectives against which these approaches are being evaluated?

@quinnj
Copy link
Member Author

quinnj commented Nov 2, 2016

@davidagold, the motivation here is the inconsistency between certain DataFrames constructors allowing a user to input anything they want as a "column" vs. other, more opinionated constructors that force conversion to NullableArray/CategoricalArray. And finally, the fact that many current DataFrame operations rely on certain methods being defined on a "column", yet this implicit interface is not formally acknowledged or documented anywhere.

The goal with this issue would be more explicit documentation of what's expected for a DataFrame "column".

@davidagold
Copy link

I should have been clearer. I think I meant to ask:

  • Which "user" patterns do we want to support?
  • Which "developer" patterns do we want to support?

People are voicing support for option 2, which suggests that an implicit objective is to support df[:A] = blah where blah can be any "column-like" thing. That's fine by me. I just want to make it explicit somewhere that it is our objective to support this kind of use case. Are there other use cases we're interested in?

It seems the end goals of this design decision extend beyond documentation, since otherwise we could just, well, write documentation =p I'm just trying to figure out how to weigh these different options.

@datnamer
Copy link

datnamer commented Nov 2, 2016

I would like to be able to assign a distributed or out of core vector to a df column.

Edit: I would assume that would also mean minimizing use of the iteration interface since use of such would likely be slow in non in-memory datastructures. Though, I think @shashi and @MikeInnes were working on an interface for writing memory storage location generic iterative code? Maybe this is a good motivating case.

@nalimilan
Copy link
Member

I'm not sure DataFrame is the best structure to work with distributed data. Since it does not include any specific support for this kind of vector, it won't be able to optimize operations like a dedicated structure would. Or the package would need to be improved to handle this use case specifically.

@quinnj What did you have in mind when you mentioned this?

@datnamer
Copy link

datnamer commented Nov 2, 2016

Right, but i was thinking it could define fallback abstract methods that a "BigDataframe" package could overload.

@nalimilan
Copy link
Member

Then that's a completely different issue. This one is really about the DataFrame type. See https://github.com/davidagold/AbstractTables.jl for a generic interface.

@bkamins
Copy link
Member

bkamins commented Jan 2, 2017

From perspective of a user of DataFrames I would expect that any AbstractVector type column is accepted. Personally I mostly work with simulated data and then I know that I will not have any nulls in the data so I do not need pay the overhead of Nullable.

The benefit of AbstractVector is that the interface required for it is already defined in Base Julia.
Not all AbstractVectors support all operations that current DataFrames operations rely on, but I would accept it and throw error in such cases (or provide a reasonable fallback).
E.g. if someone makes 1:10 a column of a DataFrame then setindex! will not work, but this is something that the user would know and accept in my opinion (the same trade-off is made in Base Julia, which treats UnitRange as AbstractArray although it does not conform in 100% to its interface).

Then, I would consider to change the definition of DataFrame to:

type DataFrame <: AbstractDataFrame
    columns::Vector{AbstractVector}
    colindex::Index

to enforce this restriction.

@piever
Copy link

piever commented Jan 14, 2017

I wanted to ask what is the current state of affairs on this issue. I've just tried out DataFrames master and I haven't been able to convert columns to regular arrays (i.e. columns get automatically converted to nullablearrays and I'm not able to convert them back). Does it mean that option 1) has been chosen or things are still being decided/implemented?
An important factor which hasn't been mentioned so far in this thread is that, while for more experienced users NullableArrays may be reasonably easy to learn, I don't think the same is true for newcomers. Personally I'd find it much easier to teach new users how to work with DataFrames with regular Arrays.

@nalimilan
Copy link
Member

I think most people agree that it should be possible to use any array type as columns. Somebody just needs to make the relevant changes. For now, you can use this constructor to preserve the array type if you want.

@quinnj
Copy link
Member Author

quinnj commented Sep 7, 2017

Closing since DataFrames no longer auto-promotes column types.

@quinnj quinnj closed this as completed Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants