Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with Nullable DataFrames #1148

Closed
madeleineudell opened this issue Jan 16, 2017 · 22 comments
Closed

Working with Nullable DataFrames #1148

madeleineudell opened this issue Jan 16, 2017 · 22 comments

Comments

@madeleineudell
Copy link

I'm maintaining a package that depends on DataFrames, and I'm a bit at a loss as to how to maintain the package at this point with the addition of the Nullable type.

My code uses plenty of arithmetic, indexing, type checking, etc etc. Now none of this works because most functions on Nullables don't propagate to the underlying value even if that instance of the nullable is not null. So addition doesn't work; subtraction doesn't work; type checking for integers or reals doesn't work; etc.

I could rewrite all these functions but that's a huge amount of work, and it's not clear to me from reading these issues whether the API is stable yet anyway. And I certainly won't be able to maintain backward compatibility, which saddens me.

What's your recommendation for package maintainers who depend on DataFrames?

@nalimilan
Copy link
Member

See also #1092. It really depends on what the package does. Where possible, the plan is to use high-level APIs provided by StructuredQueries (not ready yet). Then, packages working on numeric data should just convert NullableArray to Array, or similarly call get on Nullable, since they cannot accept missing values anyway. Other cases will need special handling but it's hard to say anything without more details.

@datnamer
Copy link

Doesn't @DavidGold have an autolifting scheme in the works?

@nalimilan
Copy link
Member

Automatic lifting will only be enabled inside query macros. OTOH element-wise operators are now lifting in Julia 0.6.

@datnamer
Copy link

What about broadcasting of elwise?

@nalimilan
Copy link
Member

What do you mean?

@nalimilan
Copy link
Member

@madeleineudell Can you tell us more about the issues you experience?

@madeleineudell
Copy link
Author

madeleineudell commented Jan 25, 2017 via email

@nalimilan
Copy link
Member

Honestly, I don't think data frames were ever considered as equivalent to matrices (@johnmyleswhite may want to comment); they are more like databases That said, NullableArray currently defines standard operators on Nullable, and Julia 0.6 supports element-wise versions for lifting, even when mixing nullable and non-nullable arguments, e.g. Nullable(1) .+ 1 -> Nullable(2). Would that suit your needs?

@nalimilan
Copy link
Member

By the way, what package are we talking about? That would certainly help me to understand your requirements.

@madeleineudell
Copy link
Author

madeleineudell commented Jan 25, 2017 via email

@nalimilan
Copy link
Member

OK. I would have thought you wouldn't actually have to work with data frames: using StatsModels to transform a dataframe+formula to a matrix, and then work only with standard matrices. Is that possible?

@madeleineudell
Copy link
Author

madeleineudell commented Jan 26, 2017 via email

@kleinschmidt
Copy link
Contributor

It might be as simple as adding separate constructors for a DataFrame that converts it to a matrix via ModelMatrix(ModelFrame(d, f)) (with an appropriately constructed formula).

Alternatively, is it possible to convert(Matrix, d)?

@kleinschmidt
Copy link
Contributor

Or, use the StatisticalModel type from StatsBase (which will be moved to StatsModels soon, once someone finds the time), and then you get methods for dataframes + formulas for free (via the DataFrameStatisticalModel type, https://github.com/JuliaStats/StatsModels.jl/blob/master/src/statsmodel.jl). The only pain there is constructing an appropriate formula.

@kleinschmidt
Copy link
Contributor

Sorry, I see now that you're doing something more interesting than just converting the dataframe with numeric columns to a matrix. I still think that it makes sense to use ModelMatrix. In fact, the goal of StatsModels.jl is to provide tools for converting tables with a mix of categorical/numerical/etc. data into matrices for modeling, so yours is exactly the kind of use case we're aiming for.

@datnamer
Copy link

What is the cost of converting a very large dataframe or database? Won't this be prohibitive sometimes vs operating in place?

@kleinschmidt
Copy link
Contributor

kleinschmidt commented Jan 26, 2017

Sure, but as far as I can tell LowRankModels already assumes things will fit in memory (but I could be wrong). One of the goals is to generalize the current implementation to work with a more general interface that doesn't assume things are in-memory but can work with, say, chunks of a table or an iterator of row tuples.

@ExpandingMan
Copy link
Contributor

I just wanted to throw my comments in here as someone else who has been using the current master and NullableArrays extensively, but hasn't written a single line of code for DataFrames.jl itself.

From my experience so far, DataFrames with the NullableArrays back-end needs 3 major quality-of-life improvements before developers of other packages can be reasonably expected to use it:

  • Easy querying. This means operations like masking df[df[:A] .> 0.0, :] have to be made to work easily again. I've set up some of my own macros for doing things like @constrain(df, :A > 0.0). I'm a big fan of DataFramesMeta.jl, but unfortunately there's no branch for NullableArrays yet, and it isn't really being maintained. Work on something like DataFramesMeta would allow for easy querying in the presence of NullableArray.

  • Allowing Vector column types (as opposed to just NullableVector) and relatively easy conversions between them. Usually at the end of the day missing values have to be dealt with one way or the other anyway. There should be an easy way of filling the missing values in individual columns and converting the columns in the dataframe to regular Vectors. Mixed column types should work whenever it is reasonable for them to do so.

  • A decision needs to be made on whether Nullable is the appropriate type. If it isn't, there needs to be a new equivalent. In most cases we expect our missing values to behave like NaN and be propagated. That's not really what Nullable was designed for. This has been discussed extensively, but a decision needs to be made.

@nalimilan
Copy link
Member

Please, let's not turn this issue into another discussion of the general roadmap. These have already happened and are happening in other places.

Easy querying. This means operations like masking df[df[:A] .> 0.0, :] have to be made to work easily again. I've set up some of my own macros for doing things like @Constrain(df, :A > 0.0). I'm a big fan of DataFramesMeta.jl, but unfortunately there's no branch for NullableArrays yet, and it isn't really being maintained. Work on something like DataFramesMeta would allow for easy querying in the presence of NullableArray.

See https://discourse.julialang.org/t/announcement-dataframes-0-9-0-planned-for-february/266. Though we obviously won't respect the schedule, which may imply changes in strategy, cf. #1154.

Allowing Vector column types (as opposed to just NullableVector) and relatively easy conversions between them. Usually at the end of the day missing values have to be dealt with one way or the other anyway. There should be an easy way of filling the missing values in individual columns and converting the columns in the dataframe to regular Vectors. Mixed column types should work whenever it is reasonable for them to do so.

#1119

A decision needs to be made on whether Nullable is the appropriate type. If it isn't, there needs to be a new equivalent. In most cases we expect our missing values to behave like NaN and be propagated. That's not really what Nullable was designed for. This has been discussed extensively, but a decision needs to be made.

JuliaLang/Juleps#21

@ararslan
Copy link
Member

@madeleineudell You may continue to use DataFrames with your package, as the Nullable-based code has been separated into the DataTables package.

@madeleineudell
Copy link
Author

madeleineudell commented Mar 13, 2017 via email

@ararslan
Copy link
Member

I'm going to close this issue now, as any further concerns regarding Nullables in this context should be posted to the DataTables repo. Thank you for the initial report--I think it was instrumental in our decision to separate the projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants