Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

column types depend on how dataframes are declared #1091

Closed
ExpandingMan opened this issue Oct 4, 2016 · 23 comments
Closed

column types depend on how dataframes are declared #1091

ExpandingMan opened this issue Oct 4, 2016 · 23 comments

Comments

@ExpandingMan
Copy link
Contributor

If this is a duplicate issue I apologize as it seems likely this has been discussed before but I couldn't find it.

If a DataFrame is declared with an array as its argument DataFrame(A) its columns are of type Vector.

If a DataFrame is declared with separate arrays for each column DataFrame(A=A, B=B) or df = DataFrame(); df[:A] =A; df[:B] = B then its columns are of type DataArray.

This seems like bad behavior. Shouldn't columns always be of type DataArray regardless of how the DataFrame was constructed?

@ararslan
Copy link
Member

ararslan commented Oct 4, 2016

Note that DataArrays are no longer used by DataFrames. Do you still see this behavior with the current master?

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Oct 4, 2016

Sorry, should have specified, this was with the latest release 0.8.3. It didn't occur to me that this was something that changed (as I didn't realize DataArray is no longer used).

@ararslan
Copy link
Member

ararslan commented Oct 4, 2016

No problem. Yeah, the switch away from DataArrays is a pretty major change, so we're waiting a bit to tag it as a release.

@ExpandingMan
Copy link
Contributor Author

I'm curious, could you point me to the issue or discussion where you discuss this change? I'm assuming this is for efficiency purposes and I'm wondering what the exact motivations are. Thanks.

@nalimilan
Copy link
Member

I've preserved this behavior when porting to NullableArray, but I think we should change it. There have been requests for that in others threads.

@ExpandingMan It's not done only for efficiency purposes, but mainly to allow you to store missing values in any column. Without this, you need to convert the column first to be able to add an NA/NULL. But that's a minor annoyance in real world cases.

@ExpandingMan
Copy link
Contributor Author

Hmm... I'm probably not understanding this because that seems like a really huge change to fix what seems to me like a very small problem. In most real cases all of the columns are DataArray anyway. It seems like it would be very simple to just ensure that all the columns of the dataframe are DataArray, after all wouldn't you have to do the same thing with NullableArray? I must be missing something...

@nalimilan
Copy link
Member

In most real cases all of the columns are DataArray anyway.

Not all people agree with this appreciation. If your data doesn't contain any missing values (not my case unfortunately), it's pointless to create a NullableArray with all it's overhead/complexity. Anyway, in the real world, you read data from external sources which will use the appropriate type depending on whether there a missing values or not. Then you'll apply transformations to this data, and the type will be propagated as needed.

@johnmyleswhite
Copy link
Contributor

There are two distinct issues:

  • DataFrames is moving towards NullableArrays. That has nothing to do with this issue.
  • Automatic conversion of Arrays to something else (either DataArray or NullableArray) was always controversial. I am close to the only person who thinks that we should prevent users from inserting an Array into a DataFrame. For me, consistency arguments make clear that we should never have Arrays as columns. For others, efficiency matters much more and consistency is not that important.

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Oct 4, 2016

It definitely seems to me that it would be rather difficult to accommodate dataframes that store columns as one of several different types in the general case. Now every program that uses dataframes needs different functions for handling dataframes depending on the types of the columns. This wouldn't be especially clean because the super type of NullableArray isn't just AbstractArray it is AbstractArray{Nullable{T}, N} (on second thought, I guess the possible problems have more to do with what happens when individual elements are accessed). That seems like a big obstacle for the usefulness of DataFrames in external programs. Besides, isn't part of the goal of NullableArrays (according to their readme) to ultimately help alleviate the "black box" optimization efficiency issues?

@nalimilan
Copy link
Member

Besides, isn't part of the goal of NullableArrays (according to their readme) to ultimately help alleviate the "black box" optimization efficiency issues?

It alleviates the unnecessary efficiency issues, but there will always be an overhead to allowing for missing values.

@davidagold
Copy link

davidagold commented Oct 4, 2016

Besides, isn't part of the goal of NullableArrays (according to their readme) to ultimately help alleviate the "black box" optimization efficiency issues?

This has to do with type-inferability of getindex(X::NullableArray, i). It has nothing to do with what lives inside a DataFrame.

EDIT: Well, nothing directly, anyway.

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Oct 10, 2016

So do you guys have any timeline in mind for the release of the NullableArrays version? I've started re-writing some code to work with master and I was wondering how far out the official release might be.

(And sorry, the issue thread probably isn't the appropriate place to be asking this sort of question.)

@nalimilan
Copy link
Member

See #1092 and linked discussions.

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Oct 10, 2016

Has there been any thought to adding a new subtype of AbstractDataFrame that doesn't allow null values at all (and has columns which have to be of type Vector)? I envision having CompleteDataFrame <: AbstractDataFrame for Vector and DataFrame <: AbstractDataFrame for NullableArray columns, with the appropriate conversions. Maybe also a MixedDataFrame if there is any benefit.

This way data could be stored in DataFrames, get it's values filled in or dropped or whatever, then converted to CompleteDataFrame for other tasks (i.e. machine learning, whatever). This would also remove the column-type ambiguity so it could be checked much more simply.

@nalimilan
Copy link
Member

For now we have enough work to do to get DataFrames working well in the new Nullable framework without adding a separate type. But feel free to experiment it if you find it useful. Though FWIW it won't fix any type stability issues, as the element type of the vectors can still be different from one column to another (else, just use a matrix or a typed vector of vectors).

@ararslan
Copy link
Member

I think @ExpandingMan is onto something, though I'm wondering if in practice it could get messy. Note that SQL (or at least some variants?) have a concept of NOT NULL for columns that do not allow null values. An analog in this scenario would be the MixedDataFrame, as you called it.

@nalimilan
Copy link
Member

The specific problem here is that Feather doesn't support non-nullable columns. So there's no way of storing that type information which exists in Julia (and e.g. in SQL). It's just like what happens when saving to CSV and reloading data (though less severe).

@ExpandingMan
Copy link
Contributor Author

Though FWIW it won't fix any type stability issues, as the element type of the vectors can still be different from one column to another (else, just use a matrix or a typed vector of vectors).

Yes but with the interface in its current state, it actually does make a difference, as if you get values from a NullableArray you need to check for Nullable() and use get whereas with Array you don't, even if this is not a "type stability" issue per se. I suppose with enough really clever design of the AbstractDataFrame interface most of the current issues could be alleviated but it might be tough. Finally, I expect that in some special cases having only Array would be more performant, not necessarily in the the DataFrames code but in the code that uses it (i.e. no checking for Nullable()). Or at least it would be for users who may not know exactly how to use NullableArray in the most efficient way...

@davidagold
Copy link

I'm not opposed to some sort of CompleteTable type, but I think it is a bit of a red herring. I suspect we will reap greater long term dividends from instead making sure that all DataFrame interfaces are generic over whatever comes out of getindex(df::DataFrame, attribute::Symbol), be it an Array or NullableArray. With sufficient engineering, I think most null-checking can be optimized away for non-Nullable columns, since that information is reflected in the eltype.

@ExpandingMan
Copy link
Contributor Author

I suspect we will reap greater long term dividends from instead making sure that all DataFrame interfaces are generic

That would certainly be the best possible outcome. I think it would be imperative to ensure that it be very rare for code to have to check the types of the columns, and in order to do this, it might require making some changes to the interface that may not be popular.

For example, for map to work without having to check the column type, the default behavior would have to be what is currently set by the option lift=true, because otherwise there is no guarantee that it would behave similarly to a regular Array. Presumably there would be a lot of this sort of change, I suspect this "hidden" handling of nulls may be unpopular, but I'm not opposed to it.

@nalimilan
Copy link
Member

See also previous discussion at #1008 (comment) and following comments.

@nalimilan
Copy link
Member

And new discussion at #1119.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Closing as DataFrames (current master) doesn't auto-promote columns anymore; so whatever you put inside is what it will stay and what you will get out again.

@quinnj quinnj closed this as completed Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants