DataFrames: what's inside? #1119

quinnj · 2016-11-02T05:49:10Z

Up till this point, and even since the internal overhaul to NullableArrays/CategoricalArrays, it's been a little ambiguous and "hard-coded" with what a user could put in a DataFrame; i.e. how the columns of a DataFrame were represented. In some constructors, whatever the user provided is simply inserted and used, while other constructors go through the upgrade_vector! code path which will do a translation from AbstractArray => NullableArray.

There have been various discussions that have hinted at some of these issues, but I wanted to formally address them and propose three potential ways forward:

Commit fully and solely to NullableArray/CategoricalArray internal storage; no other AbstractVector-type types would be allowed and errors would be thrown if attempted. Benefits: Consistent internal storage, opportunities to hand-optimize things by tightly coupling the DF/NA/CA packages.
Develop an official AbstractColumn interface, i.e. a set of methods that would be required by any type a user wished to use as a DataFrame column storage. All other DataFrames functionality would then be solely based off of these official interface methods. DataFrames would still REQUIRE NullableArrays/CategoricalArrays to provide reasonable defaults, but upgrade_vector! would be eliminated. Benefits: Users can use any type they want, probably most-often regular Vector{T}s. It also provides a looser coupling with the NA/CA packages, allowing for a future when potentially more efficient "column" types are developed or that provide custom functionality (i.e. out-of-core, distributed columns, database-backed columns, etc.).
There does exist a more radical third option, which would be to disallow any kind of "column" operations on DataFrames. E.g. You wouldn't be able to extract individual columns, or set individual columns. New "interaction" methods could be provided that relied solely on row-iterating (providing strongly typed tuples) or providing "apply" functions that could transform DataFrame columns in type-stable ways. With the maturation of the StructuredQuery.jl and Query.jl packages, there also exists other options for DataFrame manipulations which make this option even more viable. Perhaps this would be too breaking, but I think it's worth considering.

Anyway, I've started documenting the actual methods that are relied upon in DataFrames for the "column" types we put in, for future reference:

length(_) => Int
getindex(_, x::Real) => T
getindex(_, row_inds::Union{AbstractVector{T}, AbstractVector{Nullable{T}}}) => view
setindex!(_, row_ind::Real, v::Any)
setindex!(_, row_inds::AbstractVector{T<:Real}, v::Any)
deleteat!(_, ind::Union{Integer, UnitRange{Int}})
hcat!(df::DataFrame, _)
append!(_, _)
eltype(_)
push!(_, v::T)
pop!(_)
_nonnull!(res, _)
countnull(_)
_isnullable(_)  # not sure on this one

The text was updated successfully, but these errors were encountered:

nalimilan · 2016-11-02T10:41:38Z

Thanks for writing this summary. I'd be inclined to take option 2, but option 1 has the advantage of simplicity.

Regarding the column interface, _isnullable is not needed AFAIK: it's just an internal convenience method, defined as _isnullable{T}(::AbstractArray{T}) = T <: Nullable. So as long as we require columns to be AbstractArray (which I think we should do), we're fine. _nonnull! also has a reasonable fallback, so it only needs to be defined for performance. Same for countnull, which shouldn't even live in DataFrames. Finally, we hcat! methods are only there to promote columns to NullableArray.

tshort · 2016-11-02T11:02:53Z

I like option 2. Better support of Vectors and other column types would be useful.

ExpandingMan · 2016-11-02T13:56:02Z

I like option 2 as well. I definitely think it is good for users to be able to make columns either NullableVector or Vector depending on what the case calls for, though obviously that presents some interface challenges. As a usage example: when I get to the point of actually doing something with my data, I usually have to do something about the nulls anyway, so at that point I'd much prefer to have all my columns behave just like ordinary Julia Vectors. I think it is crucial that dataframes are kept as "array-like" as possible, i.e. columns should be as interchangeable with Vectors as possible, and slices of dataframes should be easy to convert to and from rank-2 arrays. Julia has the nicest built-in (or add-in, for that matter) arrays of any language by a wide margin, so it would be nice if DataFrames could reflect that as much as possible.

By the way, at the moment it really isn't at all apparent how to make a column a Vector as opposed to a NullableVector. Certain constructors do it, but none of it is obvious.

davidagold · 2016-11-02T15:10:46Z

What, specifically, are the objectives against which these approaches are being evaluated?

quinnj · 2016-11-02T15:54:21Z

@davidagold, the motivation here is the inconsistency between certain DataFrames constructors allowing a user to input anything they want as a "column" vs. other, more opinionated constructors that force conversion to NullableArray/CategoricalArray. And finally, the fact that many current DataFrame operations rely on certain methods being defined on a "column", yet this implicit interface is not formally acknowledged or documented anywhere.

The goal with this issue would be more explicit documentation of what's expected for a DataFrame "column".

davidagold · 2016-11-02T16:10:00Z

I should have been clearer. I think I meant to ask:

Which "user" patterns do we want to support?
Which "developer" patterns do we want to support?

People are voicing support for option 2, which suggests that an implicit objective is to support df[:A] = blah where blah can be any "column-like" thing. That's fine by me. I just want to make it explicit somewhere that it is our objective to support this kind of use case. Are there other use cases we're interested in?

It seems the end goals of this design decision extend beyond documentation, since otherwise we could just, well, write documentation =p I'm just trying to figure out how to weigh these different options.

datnamer · 2016-11-02T17:18:43Z

I would like to be able to assign a distributed or out of core vector to a df column.

Edit: I would assume that would also mean minimizing use of the iteration interface since use of such would likely be slow in non in-memory datastructures. Though, I think @shashi and @MikeInnes were working on an interface for writing memory storage location generic iterative code? Maybe this is a good motivating case.

nalimilan · 2016-11-02T22:24:16Z

I'm not sure DataFrame is the best structure to work with distributed data. Since it does not include any specific support for this kind of vector, it won't be able to optimize operations like a dedicated structure would. Or the package would need to be improved to handle this use case specifically.

@quinnj What did you have in mind when you mentioned this?

datnamer · 2016-11-02T22:59:59Z

Right, but i was thinking it could define fallback abstract methods that a "BigDataframe" package could overload.

nalimilan · 2016-11-03T13:38:08Z

Then that's a completely different issue. This one is really about the DataFrame type. See https://github.com/davidagold/AbstractTables.jl for a generic interface.

bkamins · 2017-01-02T16:17:15Z

From perspective of a user of DataFrames I would expect that any AbstractVector type column is accepted. Personally I mostly work with simulated data and then I know that I will not have any nulls in the data so I do not need pay the overhead of Nullable.

The benefit of AbstractVector is that the interface required for it is already defined in Base Julia.
Not all AbstractVectors support all operations that current DataFrames operations rely on, but I would accept it and throw error in such cases (or provide a reasonable fallback).
E.g. if someone makes 1:10 a column of a DataFrame then setindex! will not work, but this is something that the user would know and accept in my opinion (the same trade-off is made in Base Julia, which treats UnitRange as AbstractArray although it does not conform in 100% to its interface).

Then, I would consider to change the definition of DataFrame to:

type DataFrame <: AbstractDataFrame
    columns::Vector{AbstractVector}
    colindex::Index

to enforce this restriction.

piever · 2017-01-14T19:04:08Z

I wanted to ask what is the current state of affairs on this issue. I've just tried out DataFrames master and I haven't been able to convert columns to regular arrays (i.e. columns get automatically converted to nullablearrays and I'm not able to convert them back). Does it mean that option 1) has been chosen or things are still being decided/implemented?
An important factor which hasn't been mentioned so far in this thread is that, while for more experienced users NullableArrays may be reasonably easy to learn, I don't think the same is true for newcomers. Personally I'd find it much easier to teach new users how to work with DataFrames with regular Arrays.

nalimilan · 2017-01-14T20:59:37Z

I think most people agree that it should be possible to use any array type as columns. Somebody just needs to make the relevant changes. For now, you can use this constructor to preserve the array type if you want.

quinnj · 2017-09-07T05:44:34Z

Closing since DataFrames no longer auto-promotes column types.

nalimilan mentioned this issue Nov 2, 2016

column types depend on how dataframes are declared #1091

Closed

This was referenced Jan 25, 2017

Is there a convenient way to convert a column of a DataFrame from a DataArray to an Array? #1022

Closed

Joining dataframe of nullable arrays with DataFrame of arrays gives error (on master) #1151

Closed

Working with Nullable DataFrames #1148

Closed

nalimilan mentioned this issue Feb 6, 2017

Storing Arrays in a DataFrame column #1157

Closed

nalimilan mentioned this issue Feb 24, 2017

Enhance joining and grouping JuliaData/DataTables.jl#17

Merged

cjprybol mentioned this issue Feb 27, 2017

Use whatever column-type you want JuliaData/DataTables.jl#24

Closed

cjprybol mentioned this issue Aug 18, 2017

WIP: DataTables.jl Backport #1214

Closed

4 tasks

quinnj closed this as completed Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrames: what's inside? #1119

DataFrames: what's inside? #1119

quinnj commented Nov 2, 2016

nalimilan commented Nov 2, 2016

tshort commented Nov 2, 2016

ExpandingMan commented Nov 2, 2016

davidagold commented Nov 2, 2016

quinnj commented Nov 2, 2016

davidagold commented Nov 2, 2016

datnamer commented Nov 2, 2016 •

edited

Loading

nalimilan commented Nov 2, 2016

datnamer commented Nov 2, 2016

nalimilan commented Nov 3, 2016

bkamins commented Jan 2, 2017

piever commented Jan 14, 2017

nalimilan commented Jan 14, 2017

quinnj commented Sep 7, 2017

DataFrames: what's inside? #1119

DataFrames: what's inside? #1119

Comments

quinnj commented Nov 2, 2016

nalimilan commented Nov 2, 2016

tshort commented Nov 2, 2016

ExpandingMan commented Nov 2, 2016

davidagold commented Nov 2, 2016

quinnj commented Nov 2, 2016

davidagold commented Nov 2, 2016

datnamer commented Nov 2, 2016 • edited Loading

nalimilan commented Nov 2, 2016

datnamer commented Nov 2, 2016

nalimilan commented Nov 3, 2016

bkamins commented Jan 2, 2017

piever commented Jan 14, 2017

nalimilan commented Jan 14, 2017

quinnj commented Sep 7, 2017

datnamer commented Nov 2, 2016 •

edited

Loading