-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with Nullable DataFrames #1148
Comments
See also #1092. It really depends on what the package does. Where possible, the plan is to use high-level APIs provided by StructuredQueries (not ready yet). Then, packages working on numeric data should just convert |
Doesn't @DavidGold have an autolifting scheme in the works? |
Automatic lifting will only be enabled inside query macros. OTOH element-wise operators are now lifting in Julia 0.6. |
What about broadcasting of elwise? |
What do you mean? |
@madeleineudell Can you tell us more about the issues you experience? |
Sure; when I go to index elements of the DataFrame `x = df[3,4]`, the
element x is of type `Nullable{whatever}`. In the previous incarnation of
DataFrames, x was of type `whatever`, and so methods designed for
`whatever`s worked on x. For example, I have a bunch of code in which
`whatever` is an Int or a Float64 or a Bool, and so I merrily take x and
multiply, divide, exponentiate etc. Now, I'd need to have my code use
x.value rather than x.
This wouldn't be a problem (other than a coding pain, because I can't just
use a simple find-replace), except that my code is also designed to work
with Matrices as well as with DataFrames. So that means that everywhere in
my code I'd need to sprinkle `if isa(x, Nullable) ...`, or I'd need to
define a new method for every function (+,-,exp,...) whose input might be
nullable, etc. In other words, it's pretty annoying if DataFrames and
Matrices are no longer interoperable.
If you overloaded all the functions on integers so that +(x::Nullable{Int},
y::Int) = +(x.value, y), for every operation, and ditto for other argument
types (Float64, Bool, etc), then that would go some way towards fixing
this. But I don't think every package maintainer who uses DataFrames should
have to write all those macros. (And in my case, it would take me quite
some time to figure out how to do so.)
…On Wed, Jan 25, 2017 at 12:57 AM, Milan Bouchet-Valat < ***@***.***> wrote:
@madeleineudell <https://github.com/madeleineudell> Can you tell us more
about the issues you experience?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1148 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAyp9O8aNN0iwG6v85Fnyx6KPH3oGtGCks5rVw5egaJpZM4LkLWa>
.
--
Madeleine Udell
Assistant Professor, Operations Research and Information Engineering
Cornell University
https://people.orie.cornell.edu/mru8/
(415) 729-4115
|
Honestly, I don't think data frames were ever considered as equivalent to matrices (@johnmyleswhite may want to comment); they are more like databases That said, |
By the way, what package are we talking about? That would certainly help me to understand your requirements. |
I'm talking about LowRankModels in particular. We do PCA, sparse PCA,
nonnegative matrix factorization, one-bit PCA etc on both fully observed
matrices and partially observed tables of data.
I think that element-wise (automatic or accessible by using a single
additional package) lifting would indeed suit my needs, but I'd need to
check...!
…On Wed, Jan 25, 2017 at 1:49 PM, Milan Bouchet-Valat < ***@***.***> wrote:
By the way, what package are we talking about? That would certainly help
me to understand your requirements.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1148 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAyp9Mk5kNqSbBSydu5HHjcQ9cX1OPzJks5rV8NRgaJpZM4LkLWa>
.
--
Madeleine Udell
Assistant Professor, Operations Research and Information Engineering
Cornell University
https://people.orie.cornell.edu/mru8/
(415) 729-4115
|
OK. I would have thought you wouldn't actually have to work with data frames: using StatsModels to transform a dataframe+formula to a matrix, and then work only with standard matrices. Is that possible? |
I think it should be possible to use StatsModels, but it requires a
complete rethinking of the way modeling works in LowRankModels.
…On Wed, Jan 25, 2017 at 2:01 PM, Milan Bouchet-Valat < ***@***.***> wrote:
OK. I would have thought you wouldn't actually have to work with data
frames: using StatsModels to transform a dataframe+formula to a matrix, and
then work only with standard matrices. Is that possible?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1148 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAyp9KGlGh-OCziyE3lrr3Z5kPX653Tpks5rV8YugaJpZM4LkLWa>
.
--
Madeleine Udell
Assistant Professor, Operations Research and Information Engineering
Cornell University
https://people.orie.cornell.edu/mru8/
(415) 729-4115
|
It might be as simple as adding separate constructors for a Alternatively, is it possible to |
Or, use the |
Sorry, I see now that you're doing something more interesting than just converting the dataframe with numeric columns to a matrix. I still think that it makes sense to use |
What is the cost of converting a very large dataframe or database? Won't this be prohibitive sometimes vs operating in place? |
Sure, but as far as I can tell LowRankModels already assumes things will fit in memory (but I could be wrong). One of the goals is to generalize the current implementation to work with a more general interface that doesn't assume things are in-memory but can work with, say, chunks of a table or an iterator of row tuples. |
I just wanted to throw my comments in here as someone else who has been using the current master and From my experience so far, DataFrames with the
|
Please, let's not turn this issue into another discussion of the general roadmap. These have already happened and are happening in other places.
See https://discourse.julialang.org/t/announcement-dataframes-0-9-0-planned-for-february/266. Though we obviously won't respect the schedule, which may imply changes in strategy, cf. #1154.
|
@madeleineudell You may continue to use DataFrames with your package, as the |
Great!
On Mar 13, 2017 4:22 PM, "Alex Arslan" <notifications@github.com> wrote:
@madeleineudell <https://github.com/madeleineudell> You may continue to use
DataFrames with your package, as the Nullable-based code has been separated
into the DataTables package.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1148 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAyp9C1G-TaMfZixeVQBLRtdguxhvovXks5rlaWggaJpZM4LkLWa>
.
|
I'm going to close this issue now, as any further concerns regarding |
I'm maintaining a package that depends on DataFrames, and I'm a bit at a loss as to how to maintain the package at this point with the addition of the Nullable type.
My code uses plenty of arithmetic, indexing, type checking, etc etc. Now none of this works because most functions on Nullables don't propagate to the underlying value even if that instance of the nullable is not null. So addition doesn't work; subtraction doesn't work; type checking for integers or reals doesn't work; etc.
I could rewrite all these functions but that's a huge amount of work, and it's not clear to me from reading these issues whether the API is stable yet anyway. And I certainly won't be able to maintain backward compatibility, which saddens me.
What's your recommendation for package maintainers who depend on DataFrames?
The text was updated successfully, but these errors were encountered: