column types depend on how dataframes are declared #1091

ExpandingMan · 2016-10-04T17:17:12Z

If this is a duplicate issue I apologize as it seems likely this has been discussed before but I couldn't find it.

If a DataFrame is declared with an array as its argument DataFrame(A) its columns are of type Vector.

If a DataFrame is declared with separate arrays for each column DataFrame(A=A, B=B) or df = DataFrame(); df[:A] =A; df[:B] = B then its columns are of type DataArray.

This seems like bad behavior. Shouldn't columns always be of type DataArray regardless of how the DataFrame was constructed?

The text was updated successfully, but these errors were encountered:

ararslan · 2016-10-04T17:35:20Z

Note that DataArrays are no longer used by DataFrames. Do you still see this behavior with the current master?

ExpandingMan · 2016-10-04T17:38:42Z

Sorry, should have specified, this was with the latest release 0.8.3. It didn't occur to me that this was something that changed (as I didn't realize DataArray is no longer used).

ararslan · 2016-10-04T17:46:13Z

No problem. Yeah, the switch away from DataArrays is a pretty major change, so we're waiting a bit to tag it as a release.

ExpandingMan · 2016-10-04T17:49:10Z

I'm curious, could you point me to the issue or discussion where you discuss this change? I'm assuming this is for efficiency purposes and I'm wondering what the exact motivations are. Thanks.

nalimilan · 2016-10-04T18:09:17Z

I've preserved this behavior when porting to NullableArray, but I think we should change it. There have been requests for that in others threads.

@ExpandingMan It's not done only for efficiency purposes, but mainly to allow you to store missing values in any column. Without this, you need to convert the column first to be able to add an NA/NULL. But that's a minor annoyance in real world cases.

ExpandingMan · 2016-10-04T18:17:01Z

Hmm... I'm probably not understanding this because that seems like a really huge change to fix what seems to me like a very small problem. In most real cases all of the columns are DataArray anyway. It seems like it would be very simple to just ensure that all the columns of the dataframe are DataArray, after all wouldn't you have to do the same thing with NullableArray? I must be missing something...

nalimilan · 2016-10-04T18:19:41Z

In most real cases all of the columns are DataArray anyway.

Not all people agree with this appreciation. If your data doesn't contain any missing values (not my case unfortunately), it's pointless to create a NullableArray with all it's overhead/complexity. Anyway, in the real world, you read data from external sources which will use the appropriate type depending on whether there a missing values or not. Then you'll apply transformations to this data, and the type will be propagated as needed.

johnmyleswhite · 2016-10-04T18:20:19Z

There are two distinct issues:

DataFrames is moving towards NullableArrays. That has nothing to do with this issue.
Automatic conversion of Arrays to something else (either DataArray or NullableArray) was always controversial. I am close to the only person who thinks that we should prevent users from inserting an Array into a DataFrame. For me, consistency arguments make clear that we should never have Arrays as columns. For others, efficiency matters much more and consistency is not that important.

ExpandingMan · 2016-10-04T18:45:06Z

It definitely seems to me that it would be rather difficult to accommodate dataframes that store columns as one of several different types in the general case. Now every program that uses dataframes needs different functions for handling dataframes depending on the types of the columns. This wouldn't be especially clean because the super type of NullableArray isn't just AbstractArray it is AbstractArray{Nullable{T}, N} (on second thought, I guess the possible problems have more to do with what happens when individual elements are accessed). That seems like a big obstacle for the usefulness of DataFrames in external programs. Besides, isn't part of the goal of NullableArrays (according to their readme) to ultimately help alleviate the "black box" optimization efficiency issues?

nalimilan · 2016-10-04T19:57:55Z

Besides, isn't part of the goal of NullableArrays (according to their readme) to ultimately help alleviate the "black box" optimization efficiency issues?

It alleviates the unnecessary efficiency issues, but there will always be an overhead to allowing for missing values.

davidagold · 2016-10-04T20:30:20Z

Besides, isn't part of the goal of NullableArrays (according to their readme) to ultimately help alleviate the "black box" optimization efficiency issues?

This has to do with type-inferability of getindex(X::NullableArray, i). It has nothing to do with what lives inside a DataFrame.

EDIT: Well, nothing directly, anyway.

ExpandingMan · 2016-10-10T13:29:11Z

So do you guys have any timeline in mind for the release of the NullableArrays version? I've started re-writing some code to work with master and I was wondering how far out the official release might be.

(And sorry, the issue thread probably isn't the appropriate place to be asking this sort of question.)

nalimilan · 2016-10-10T13:32:24Z

See #1092 and linked discussions.

ExpandingMan · 2016-10-10T20:53:54Z

Has there been any thought to adding a new subtype of AbstractDataFrame that doesn't allow null values at all (and has columns which have to be of type Vector)? I envision having CompleteDataFrame <: AbstractDataFrame for Vector and DataFrame <: AbstractDataFrame for NullableArray columns, with the appropriate conversions. Maybe also a MixedDataFrame if there is any benefit.

This way data could be stored in DataFrames, get it's values filled in or dropped or whatever, then converted to CompleteDataFrame for other tasks (i.e. machine learning, whatever). This would also remove the column-type ambiguity so it could be checked much more simply.

nalimilan · 2016-10-10T20:58:46Z

For now we have enough work to do to get DataFrames working well in the new Nullable framework without adding a separate type. But feel free to experiment it if you find it useful. Though FWIW it won't fix any type stability issues, as the element type of the vectors can still be different from one column to another (else, just use a matrix or a typed vector of vectors).

ararslan · 2016-10-10T21:00:04Z

I think @ExpandingMan is onto something, though I'm wondering if in practice it could get messy. Note that SQL (or at least some variants?) have a concept of NOT NULL for columns that do not allow null values. An analog in this scenario would be the MixedDataFrame, as you called it.

nalimilan · 2016-10-10T21:03:14Z

The specific problem here is that Feather doesn't support non-nullable columns. So there's no way of storing that type information which exists in Julia (and e.g. in SQL). It's just like what happens when saving to CSV and reloading data (though less severe).

ExpandingMan · 2016-10-10T21:04:05Z

Though FWIW it won't fix any type stability issues, as the element type of the vectors can still be different from one column to another (else, just use a matrix or a typed vector of vectors).

Yes but with the interface in its current state, it actually does make a difference, as if you get values from a NullableArray you need to check for Nullable() and use get whereas with Array you don't, even if this is not a "type stability" issue per se. I suppose with enough really clever design of the AbstractDataFrame interface most of the current issues could be alleviated but it might be tough. Finally, I expect that in some special cases having only Array would be more performant, not necessarily in the the DataFrames code but in the code that uses it (i.e. no checking for Nullable()). Or at least it would be for users who may not know exactly how to use NullableArray in the most efficient way...

davidagold · 2016-10-10T21:47:41Z

I'm not opposed to some sort of CompleteTable type, but I think it is a bit of a red herring. I suspect we will reap greater long term dividends from instead making sure that all DataFrame interfaces are generic over whatever comes out of getindex(df::DataFrame, attribute::Symbol), be it an Array or NullableArray. With sufficient engineering, I think most null-checking can be optimized away for non-Nullable columns, since that information is reflected in the eltype.

ExpandingMan · 2016-10-11T13:29:02Z

I suspect we will reap greater long term dividends from instead making sure that all DataFrame interfaces are generic

That would certainly be the best possible outcome. I think it would be imperative to ensure that it be very rare for code to have to check the types of the columns, and in order to do this, it might require making some changes to the interface that may not be popular.

For example, for map to work without having to check the column type, the default behavior would have to be what is currently set by the option lift=true, because otherwise there is no guarantee that it would behave similarly to a regular Array. Presumably there would be a lot of this sort of change, I suspect this "hidden" handling of nulls may be unpopular, but I'm not opposed to it.

nalimilan · 2016-11-01T15:06:03Z

See also previous discussion at #1008 (comment) and following comments.

nalimilan · 2016-11-02T10:49:27Z

And new discussion at #1119.

quinnj · 2017-09-07T05:49:49Z

Closing as DataFrames (current master) doesn't auto-promote columns anymore; so whatever you put inside is what it will stay and what you will get out again.

cjprybol mentioned this issue Aug 18, 2017

WIP: DataTables.jl Backport #1214

Closed

4 tasks

quinnj closed this as completed Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

column types depend on how dataframes are declared #1091

column types depend on how dataframes are declared #1091

ExpandingMan commented Oct 4, 2016

ararslan commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016 •

edited

Loading

ararslan commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016

nalimilan commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016

nalimilan commented Oct 4, 2016

johnmyleswhite commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016 •

edited

Loading

nalimilan commented Oct 4, 2016

davidagold commented Oct 4, 2016 •

edited

Loading

ExpandingMan commented Oct 10, 2016 •

edited

Loading

nalimilan commented Oct 10, 2016

ExpandingMan commented Oct 10, 2016 •

edited

Loading

nalimilan commented Oct 10, 2016

ararslan commented Oct 10, 2016

nalimilan commented Oct 10, 2016

ExpandingMan commented Oct 10, 2016

davidagold commented Oct 10, 2016

ExpandingMan commented Oct 11, 2016

nalimilan commented Nov 1, 2016

nalimilan commented Nov 2, 2016

quinnj commented Sep 7, 2017

column types depend on how dataframes are declared #1091

column types depend on how dataframes are declared #1091

Comments

ExpandingMan commented Oct 4, 2016

ararslan commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016 • edited Loading

ararslan commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016

nalimilan commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016

nalimilan commented Oct 4, 2016

johnmyleswhite commented Oct 4, 2016

ExpandingMan commented Oct 4, 2016 • edited Loading

nalimilan commented Oct 4, 2016

davidagold commented Oct 4, 2016 • edited Loading

ExpandingMan commented Oct 10, 2016 • edited Loading

nalimilan commented Oct 10, 2016

ExpandingMan commented Oct 10, 2016 • edited Loading

nalimilan commented Oct 10, 2016

ararslan commented Oct 10, 2016

nalimilan commented Oct 10, 2016

ExpandingMan commented Oct 10, 2016

davidagold commented Oct 10, 2016

ExpandingMan commented Oct 11, 2016

nalimilan commented Nov 1, 2016

nalimilan commented Nov 2, 2016

quinnj commented Sep 7, 2017

ExpandingMan commented Oct 4, 2016 •

edited

Loading

ExpandingMan commented Oct 4, 2016 •

edited

Loading

davidagold commented Oct 4, 2016 •

edited

Loading

ExpandingMan commented Oct 10, 2016 •

edited

Loading

ExpandingMan commented Oct 10, 2016 •

edited

Loading