Working with Nullable DataFrames #1148

madeleineudell · 2017-01-16T04:51:21Z

I'm maintaining a package that depends on DataFrames, and I'm a bit at a loss as to how to maintain the package at this point with the addition of the Nullable type.

My code uses plenty of arithmetic, indexing, type checking, etc etc. Now none of this works because most functions on Nullables don't propagate to the underlying value even if that instance of the nullable is not null. So addition doesn't work; subtraction doesn't work; type checking for integers or reals doesn't work; etc.

I could rewrite all these functions but that's a huge amount of work, and it's not clear to me from reading these issues whether the API is stable yet anyway. And I certainly won't be able to maintain backward compatibility, which saddens me.

What's your recommendation for package maintainers who depend on DataFrames?

nalimilan · 2017-01-16T08:59:16Z

See also #1092. It really depends on what the package does. Where possible, the plan is to use high-level APIs provided by StructuredQueries (not ready yet). Then, packages working on numeric data should just convert NullableArray to Array, or similarly call get on Nullable, since they cannot accept missing values anyway. Other cases will need special handling but it's hard to say anything without more details.

datnamer · 2017-01-19T16:09:21Z

Doesn't @DavidGold have an autolifting scheme in the works?

nalimilan · 2017-01-19T16:32:17Z

Automatic lifting will only be enabled inside query macros. OTOH element-wise operators are now lifting in Julia 0.6.

datnamer · 2017-01-19T17:36:05Z

What about broadcasting of elwise?

nalimilan · 2017-01-19T18:48:17Z

What do you mean?

nalimilan · 2017-01-25T08:57:01Z

@madeleineudell Can you tell us more about the issues you experience?

madeleineudell · 2017-01-25T19:43:02Z

Sure; when I go to index elements of the DataFrame `x = df[3,4]`, the element x is of type `Nullable{whatever}`. In the previous incarnation of DataFrames, x was of type `whatever`, and so methods designed for `whatever`s worked on x. For example, I have a bunch of code in which `whatever` is an Int or a Float64 or a Bool, and so I merrily take x and multiply, divide, exponentiate etc. Now, I'd need to have my code use x.value rather than x. This wouldn't be a problem (other than a coding pain, because I can't just use a simple find-replace), except that my code is also designed to work with Matrices as well as with DataFrames. So that means that everywhere in my code I'd need to sprinkle `if isa(x, Nullable) ...`, or I'd need to define a new method for every function (+,-,exp,...) whose input might be nullable, etc. In other words, it's pretty annoying if DataFrames and Matrices are no longer interoperable. If you overloaded all the functions on integers so that +(x::Nullable{Int}, y::Int) = +(x.value, y), for every operation, and ditto for other argument types (Float64, Bool, etc), then that would go some way towards fixing this. But I don't think every package maintainer who uses DataFrames should have to write all those macros. (And in my case, it would take me quite some time to figure out how to do so.)

…

On Wed, Jan 25, 2017 at 12:57 AM, Milan Bouchet-Valat < ***@***.***> wrote: @madeleineudell <https://github.com/madeleineudell> Can you tell us more about the issues you experience? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1148 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyp9O8aNN0iwG6v85Fnyx6KPH3oGtGCks5rVw5egaJpZM4LkLWa> .

-- Madeleine Udell Assistant Professor, Operations Research and Information Engineering Cornell University https://people.orie.cornell.edu/mru8/ (415) 729-4115

nalimilan · 2017-01-25T21:48:17Z

Honestly, I don't think data frames were ever considered as equivalent to matrices (@johnmyleswhite may want to comment); they are more like databases That said, NullableArray currently defines standard operators on Nullable, and Julia 0.6 supports element-wise versions for lifting, even when mixing nullable and non-nullable arguments, e.g. Nullable(1) .+ 1 -> Nullable(2). Would that suit your needs?

nalimilan · 2017-01-25T21:49:04Z

By the way, what package are we talking about? That would certainly help me to understand your requirements.

madeleineudell · 2017-01-25T21:55:13Z

I'm talking about LowRankModels in particular. We do PCA, sparse PCA, nonnegative matrix factorization, one-bit PCA etc on both fully observed matrices and partially observed tables of data. I think that element-wise (automatic or accessible by using a single additional package) lifting would indeed suit my needs, but I'd need to check...!

…

On Wed, Jan 25, 2017 at 1:49 PM, Milan Bouchet-Valat < ***@***.***> wrote: By the way, what package are we talking about? That would certainly help me to understand your requirements. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1148 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyp9Mk5kNqSbBSydu5HHjcQ9cX1OPzJks5rV8NRgaJpZM4LkLWa> .

-- Madeleine Udell Assistant Professor, Operations Research and Information Engineering Cornell University https://people.orie.cornell.edu/mru8/ (415) 729-4115

nalimilan · 2017-01-25T22:01:17Z

OK. I would have thought you wouldn't actually have to work with data frames: using StatsModels to transform a dataframe+formula to a matrix, and then work only with standard matrices. Is that possible?

madeleineudell · 2017-01-26T14:59:33Z

I think it should be possible to use StatsModels, but it requires a complete rethinking of the way modeling works in LowRankModels.

…

On Wed, Jan 25, 2017 at 2:01 PM, Milan Bouchet-Valat < ***@***.***> wrote: OK. I would have thought you wouldn't actually have to work with data frames: using StatsModels to transform a dataframe+formula to a matrix, and then work only with standard matrices. Is that possible? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1148 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyp9KGlGh-OCziyE3lrr3Z5kPX653Tpks5rV8YugaJpZM4LkLWa> .

-- Madeleine Udell Assistant Professor, Operations Research and Information Engineering Cornell University https://people.orie.cornell.edu/mru8/ (415) 729-4115

kleinschmidt · 2017-01-26T15:22:51Z

It might be as simple as adding separate constructors for a DataFrame that converts it to a matrix via ModelMatrix(ModelFrame(d, f)) (with an appropriately constructed formula).

Alternatively, is it possible to convert(Matrix, d)?

kleinschmidt · 2017-01-26T15:25:16Z

Or, use the StatisticalModel type from StatsBase (which will be moved to StatsModels soon, once someone finds the time), and then you get methods for dataframes + formulas for free (via the DataFrameStatisticalModel type, https://github.com/JuliaStats/StatsModels.jl/blob/master/src/statsmodel.jl). The only pain there is constructing an appropriate formula.

kleinschmidt · 2017-01-26T16:11:48Z

Sorry, I see now that you're doing something more interesting than just converting the dataframe with numeric columns to a matrix. I still think that it makes sense to use ModelMatrix. In fact, the goal of StatsModels.jl is to provide tools for converting tables with a mix of categorical/numerical/etc. data into matrices for modeling, so yours is exactly the kind of use case we're aiming for.

datnamer · 2017-01-26T17:27:14Z

What is the cost of converting a very large dataframe or database? Won't this be prohibitive sometimes vs operating in place?

kleinschmidt · 2017-01-26T17:37:28Z

Sure, but as far as I can tell LowRankModels already assumes things will fit in memory (but I could be wrong). One of the goals is to generalize the current implementation to work with a more general interface that doesn't assume things are in-memory but can work with, say, chunks of a table or an iterator of row tuples.

ExpandingMan · 2017-01-31T17:03:12Z

I just wanted to throw my comments in here as someone else who has been using the current master and NullableArrays extensively, but hasn't written a single line of code for DataFrames.jl itself.

From my experience so far, DataFrames with the NullableArrays back-end needs 3 major quality-of-life improvements before developers of other packages can be reasonably expected to use it:

Easy querying. This means operations like masking df[df[:A] .> 0.0, :] have to be made to work easily again. I've set up some of my own macros for doing things like @constrain(df, :A > 0.0). I'm a big fan of DataFramesMeta.jl, but unfortunately there's no branch for NullableArrays yet, and it isn't really being maintained. Work on something like DataFramesMeta would allow for easy querying in the presence of NullableArray.
Allowing Vector column types (as opposed to just NullableVector) and relatively easy conversions between them. Usually at the end of the day missing values have to be dealt with one way or the other anyway. There should be an easy way of filling the missing values in individual columns and converting the columns in the dataframe to regular Vectors. Mixed column types should work whenever it is reasonable for them to do so.
A decision needs to be made on whether Nullable is the appropriate type. If it isn't, there needs to be a new equivalent. In most cases we expect our missing values to behave like NaN and be propagated. That's not really what Nullable was designed for. This has been discussed extensively, but a decision needs to be made.

nalimilan · 2017-01-31T17:17:21Z

Please, let's not turn this issue into another discussion of the general roadmap. These have already happened and are happening in other places.

Easy querying. This means operations like masking df[df[:A] .> 0.0, :] have to be made to work easily again. I've set up some of my own macros for doing things like @Constrain(df, :A > 0.0). I'm a big fan of DataFramesMeta.jl, but unfortunately there's no branch for NullableArrays yet, and it isn't really being maintained. Work on something like DataFramesMeta would allow for easy querying in the presence of NullableArray.

See https://discourse.julialang.org/t/announcement-dataframes-0-9-0-planned-for-february/266. Though we obviously won't respect the schedule, which may imply changes in strategy, cf. #1154.

Allowing Vector column types (as opposed to just NullableVector) and relatively easy conversions between them. Usually at the end of the day missing values have to be dealt with one way or the other anyway. There should be an easy way of filling the missing values in individual columns and converting the columns in the dataframe to regular Vectors. Mixed column types should work whenever it is reasonable for them to do so.

#1119

A decision needs to be made on whether Nullable is the appropriate type. If it isn't, there needs to be a new equivalent. In most cases we expect our missing values to behave like NaN and be propagated. That's not really what Nullable was designed for. This has been discussed extensively, but a decision needs to be made.

JuliaLang/Juleps#21

ararslan · 2017-03-13T20:22:54Z

@madeleineudell You may continue to use DataFrames with your package, as the Nullable-based code has been separated into the DataTables package.

madeleineudell · 2017-03-13T21:20:29Z

Great! On Mar 13, 2017 4:22 PM, "Alex Arslan" <notifications@github.com> wrote: @madeleineudell <https://github.com/madeleineudell> You may continue to use DataFrames with your package, as the Nullable-based code has been separated into the DataTables package. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1148 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyp9C1G-TaMfZixeVQBLRtdguxhvovXks5rlaWggaJpZM4LkLWa> .

ararslan · 2017-03-14T21:18:26Z

I'm going to close this issue now, as any further concerns regarding Nullables in this context should be posted to the DataTables repo. Thank you for the initial report--I think it was instrumental in our decision to separate the projects.

ararslan mentioned this issue Jan 26, 2017

Proposal to separate Nullable backend into DataFrames2 #1154

Closed

ararslan closed this as completed Mar 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with Nullable DataFrames #1148

Working with Nullable DataFrames #1148

madeleineudell commented Jan 16, 2017

nalimilan commented Jan 16, 2017

datnamer commented Jan 19, 2017

nalimilan commented Jan 19, 2017

datnamer commented Jan 19, 2017

nalimilan commented Jan 19, 2017

nalimilan commented Jan 25, 2017

madeleineudell commented Jan 25, 2017 via email

nalimilan commented Jan 25, 2017

nalimilan commented Jan 25, 2017

madeleineudell commented Jan 25, 2017 via email

nalimilan commented Jan 25, 2017

madeleineudell commented Jan 26, 2017 via email

kleinschmidt commented Jan 26, 2017

kleinschmidt commented Jan 26, 2017

kleinschmidt commented Jan 26, 2017

datnamer commented Jan 26, 2017

kleinschmidt commented Jan 26, 2017 •

edited

Loading

ExpandingMan commented Jan 31, 2017

nalimilan commented Jan 31, 2017

ararslan commented Mar 13, 2017

madeleineudell commented Mar 13, 2017 via email

ararslan commented Mar 14, 2017

Working with Nullable DataFrames #1148

Working with Nullable DataFrames #1148

Comments

madeleineudell commented Jan 16, 2017

nalimilan commented Jan 16, 2017

datnamer commented Jan 19, 2017

nalimilan commented Jan 19, 2017

datnamer commented Jan 19, 2017

nalimilan commented Jan 19, 2017

nalimilan commented Jan 25, 2017

madeleineudell commented Jan 25, 2017 via email

nalimilan commented Jan 25, 2017

nalimilan commented Jan 25, 2017

madeleineudell commented Jan 25, 2017 via email

nalimilan commented Jan 25, 2017

madeleineudell commented Jan 26, 2017 via email

kleinschmidt commented Jan 26, 2017

kleinschmidt commented Jan 26, 2017

kleinschmidt commented Jan 26, 2017

datnamer commented Jan 26, 2017

kleinschmidt commented Jan 26, 2017 • edited Loading

ExpandingMan commented Jan 31, 2017

nalimilan commented Jan 31, 2017

ararslan commented Mar 13, 2017

madeleineudell commented Mar 13, 2017 via email

ararslan commented Mar 14, 2017

kleinschmidt commented Jan 26, 2017 •

edited

Loading