-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why move away from the Nullable approach? #5
Comments
This was the approach taken by DataArrays. We wanted missing values to be scalars. The only problem was performance due to type instability, which is why people started exploring the use of |
An important point is that while we took for granted for a long time that the |
And to be clear, I think there are certainly use-cases where But as @ararslan mentioned, I'm probably also a little at fault here for the confusion: a while ago now I actually called for an offline discussion on the future of DataFrames/Nullablility (probably 6-7 months ago now); I tried to invite the major community players that I knew personally & any core developers that would be willing to discuss. It was a time where things were very unclear with regards to a strong sense of purpose/direction for DataFrames/Nullability and I felt like having a real-time discussion to hash out the most outstanding issues would help re-invigorate and focus some of the key players/packages. Obviously, those kinds of discussions inevitably leave people out and you have emerged one of the most key people in this space. On the other hand, I felt like the discussion really served it's purpose since it helped clarify (as @nalimilan mentioned) some of the compiler possibilities that most of us had previously written off as "probably not going to be possible or any time soon". All of that said, I want to re-iterate that we're still all very much in experimentation phase; I'm by no means convinced that the |
Thanks @quinnj, that is helpful. And let me stress, I really want to sort this out with you guys as well, my goal for Query.jl has always been to not have any missing value specific code at all in it :) I think I have two worries with the Having said that, I think the approach here is obviously the right one: experiment with this and see where it leads us. At the same time I would very much like to see a similar attempt to sort out the remaining issues with the container based approach, so that at the end of the day we can compare both approaches and see which one works better. I'll write up one proposal for the container based approach in a few moments. If that is ok I'll do it here, maybe we can just centralize the whole discussion about this in this repository here? |
That's what John Myles White was describing in his Julep.
It's unlikely that automatic lifting will be at thing, as Jeff Bezanson is firmly opposed to it. It's intended that the functionality being explored in this package will be moved into Base eventually, at which time any function that can deal with null values in Base will be overloaded as such. Jeff's idea is then that not providing methods that handle missing values is a way of saying, "This function doesn't have a natural way to deal with missing values." It's expected that package authors would follow suit with their functions. I consider the container-based approach to be somewhat of a failed experiment. It was good that it was explored, but there's a reason why thinking has shifted back to the DataArrays-style approach. |
Here is one proposal for a general missing data value approach for the julia data science area. We use ScalarsFor basic arithmetic infix operators on DataValue(3) + DataValue(2) # returns DataValue(5)
DataValue(3) + 2 # returns DataValue(5)
DataValue() + 2 # returns DataValue{Int}() This works because we provide methods for arithmetic infix operators that propagate null values. For all comparison operators on DataValue(3) == DataValue(3) # returns true
DataValue(3) == 3 # returns true
DataValue(3) > 2 # returns true
DataValue(3) > DataValue() # returns false In general you can use the call-site operator log.(DataValue(3)) # returns DataValue(log(3))
DataValue(3) .+ DataValue(2) # returns DataValue(5)
DataValue(3) .== 3 # returns DataValue(true) ArraysLets assume we have an array Basic arithmetic infix operators would be used like this: x .+ x # returns Array{DataValue{Int},1} For comparisons one would use this: x .== x # returns Array{Bool, 1}
x ..== x # returns Array{DataValue{Bool},1} For any other function that one wants to apply element wise with the null propagating lifting semantics, one would write: log..(x) # returns Array{DataValue{Float64},1} For higher order functions, one could easily apply the null propagated lifted version of a scalar function like this: map(log., x) # returns Array{DataValue{Float64},1} QuestionsAre there any other use cases that I have missed? I'm sure I have, but this proposal seems to handle all the issues that I remember having seen in the various debates about this. I would think that Again, I'm sure I missed use cases, so please point them out! |
I think that was a great draft, but I never saw a clear and complete description of the lifting story spelled out. In my mind that is the part where things get complicated.
I don't understand. Is this going back to the whitelist approach to lifting?
Why? It solves pretty much every problem for Query.jl, and I think the proposal I made above could work for arrays of missing values. On the flip side, at least with the current idea, |
So even if the need to add functions to a white list isn't ideal with the |
Can you explain how that would show up? Doesn't the current
Why would white listing be easier with |
I can't explain it better than the Julep, really.
It wouldn't be easier, but if we have to whitelist, then why bother with the broadcasting approach? Why not use a standard |
I'm not sure I understand your point. Are you saying that the current
Well, for one we know that the container based approach works for Query.jl and the Plus, we could keep all the investment in the various packages that have made a move towards a container based solution and wouldn't have to start from scratch yet again? |
We aren't starting from scratch. This approach is basically optimized, cleaned up DataArrays, which is one of the oldest packages AFAIK. |
Sure, it works, but we still need to use hacks calling
How many packages? DataFrames, CategoricalArrays, DataStreams (and associated packages), and Query (which uses |
Are there three options on the table right now?
Purely in terms of user experience, are there any arguments that would favor 1 over 2, or the other way around? I don't see anything in the discussion so far that would favor 1, but at least one that favors 2 (Query.jl doesn't work with 1 unless we find a solution for #6 ), but happy to be corrected. 3 seems open to me, I'm just not sure. There is an argument to be made that all the dots get annoying, but there is also an argument that having the dots actually makes things more explicit, which could be beneficial because it might avoid bugs. I think to really say anything about the merits of 3 over 1 or 2 we would have to implement it, test it out a while with users and see what feedback we get. In terms of project management complexity and the likelihood that we will have something for julia 1.0 that works, 2 seems the clear winner. If we go with Am I missing something? There must be some more concrete argument in favor of Some more detailed responses on other points raised:
The same is true for
This looks like a trade-off to me: certainly
There is also ReadStat, TypedTable, StatsModels, TextParse, RCall and IterableTables. |
@quinnj Just out of curiosity, would there be a benefit purely for the DataStreams ecosystem from |
My initial impression is yes, and I'm actually starting to work on branches across DataStreams, CSV, and DataFrames to do so to experiment with. I was planning on writing up more thoughts tomorrow as I have a big presentation tomorrow morning, so I'll try to collect some more thoughts and share after that's over with. |
In terms of user experience, it's so much nicer that As you say, there's basically only one argument in favor of using a wrapper, it's #6. All other advantages of wrappers (lifting via broadcast...) can also be applied to
I'm not an expert either, but @vtjnash seems to consider this as doable, and he already has branches improving the performance of
I think you greatly underestimate the work required to get |
I looked at all three examples you linked to, and they all seem like cases where the current
Ah, good. If that is possible, I guess we can split the discussion into a) what data structure is better and b) is white listing or
This strikes me as a really risky plan for julia 1.0. What if the optimizations don't make it into julia 1.0 for one reason or another? I would much prefer a strategy where any work on introducing
I looked through all of these and I saw lots of examples that can't be handled if we use a container based approach and rule out white listing or nested My current thinking about all of this is that the container based approach is so unusable right now because the lifting approach taken in NullableArrays.jl really doesn't work. What we have right now is a half-way attempt at white listing functions, and then a (in my opinion failed) approach to try to automatically lift things in higher order functions like The three examplesI just created a new package DataValueOperations.jl that provides white listed lifting for the First exampleThe first example you linked to was this. Here is how that looks with the white listed using DataValueOperations
a, b = ?("14:00:00"), ?("15:15:00")
Dates.value(DateTime(a,"HH:MM:SS") - DateTime(b,"HH:MM:SS"))
c = DateTime(a,"HH:MM:SS") - DateTime(b,"HH:MM:SS") For this to work I needed to define white listed methods for Here is how it looks with the using DataValues
a, b = ?("14:00:00"), ?("15:15:00")
Dates.value.(DateTime.(a,"HH:MM:SS") .- DateTime.(b,"HH:MM:SS"))
c = DateTime.(a,"HH:MM:SS") .- DateTime.(b,"HH:MM:SS") Second exampleThe second example you linked to was this. Essentially madeleineudell at one point said that if someone provided either white listed functions or There was also the question about masking in that link. Here is how that looks with the white listed approach or the nested using DataValueOperations
a = [?(3.), ?(2.), ?(5.)]
a[a .> 2.] Third exampleThe third example you linked to was this. Again, this works just fine with either approach 2 or 3, here is how it would look: using DataValueOperations
A = [?(9), ?(8), ?(15)]
map(i->isnull(i) ? false : get(i) % 3 == 0, A)
f(i) = isnull(i) ? false : get(i) % 3 == 0
f.(A) |
All these arguments only prove that containers are not worse in these examples than Regarding the contingency plan, it's easy: continue working with DataFrames and DataArrays. Another advantage of Maybe we won't convince you, but these discussions have been going on for a long time and unless a big issue is discovered in the transition we're going to go ahead with this plan. |
The biggest reason that I see for going with I'm happy to help dig into DataValues/Query and better understand what kind of changes would be required; I've currently made progress in getting an end-to-end |
What about the following strategy: we try to pursue both approaches for a while. We move DataTables.jl over to use At the same time you guys pursue the This is a hedging strategy that should increase the likelihood that we have a performant story ready for julia 1.0. I really just don't want to stop all work on container based approaches at this point because of the
Would a fundamental incompatibility of the |
Not until it's 100% clear there is no way to get it working. Jacob proposed his help but you don't seem to care. |
There's certainly nothing wrong (and nothing stopping) trying both approaches and having a branch of DataTables that uses DataValue. I'm going to start working on a branch for DataFrames that uses Nulls.jl. I guess I'm wondering if you (David) have any other real concerns with the Anyway, I'm definitely in the spirit of wanting to collaborate and encourage healthy debates, but also want to avoid discouraging other's ideas or making things personal. I've tried to lay out why exactly I'm not a fan of the |
I think it is the symptom of a very fundamental problem pretty unrelated to Query.jl. I'll elaborate over in #6.
No, I'd be happy to investigate and switch Query.jl over to a different representation of missing values if we can find a solution that works. Also happy to help search for that solution :)
I think we should experiment with multiple approaches until we fully understand the pros and cons of all of them, but I am certainly game to try to converge eventually to one representation. I just don't think we are there yet. I did get the vibe from various comments here and over on the forums (not from @quinnj, though) that a decision has been made to make the switch, and I'm pushing back against that. Instead, I would like the official message to be "we are experimenting with a new approach (
I completely agree and I should say that I've found every single comment from you (@quinnj) a model example of that spirit. I read over my comments here, and I believe they are all about technical merits, never get personal and I also stressed over and over again that I'm fully in favor of experimenting with this |
Sorry if it sounded harsh, but I have taken the time to discuss arguments in depth, and when Jacob proposed help (after I did the same several times) you didn't reply to his offer and instead proposed another plan which would require quite some work on our part. I wouldn't qualify this as very professional either. These seemingly endless discussions have already prompted John to abandon his Nullable Julep, that may give you an idea of where I came from. I think we all need to talk less and experiment more, that's the most productive way forward. Nobody is going to release a version of DataFrames/DataTables without being certain that it can work with the whole ecosystem. |
@quinnj said:
Ahh! This is the piece of information I had been missing in previous discussions - I definitely sensed a shift in consensus here and couldn't find a matching discussion online. (Unfortunately, me trying to get to the bottom of the reasons here was taken the wrong way... sorry) What I wanted to add here is that one of the conceptual (and practical) difficulties for me with |
Closing as I think this discussion has run its course. The TL;DR is that |
I know I'm repeating myself, but I don't agree with this assessment. I've always said that if someone finds a solution for how Query.jl can work with Nulls.jl I'd be thrilled, but I haven't seen one and haven't been able to come up with one myself (not for a lack of trying). |
This is a pretty basic question, but I guess I'm not entirely clear why the goal is now to give up on something like
Nullable
? IsUnion{T,Null}
believed to eventually be faster? Or some other reason?I did read the julep that John wrote, but it actually never explained why there is this general shift towards
Union{T,Null}
.The text was updated successfully, but these errors were encountered: