Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

RFC: Stop lying about eltype #280

Merged
merged 4 commits into from
Aug 20, 2017
Merged

RFC: Stop lying about eltype #280

merged 4 commits into from
Aug 20, 2017

Conversation

andreasnoack
Copy link
Member

@andreasnoack andreasnoack commented Jul 29, 2017

As discussed on Slack. It was actually not that bad. The core changes are in abstractdataarray.jl and defines

abstract type AbstractDataArray{T, N} <: AbstractArray{Union{T,NAtype}, N} end

Base.eltype{T, N}(d::AbstractDataArray{T, N}) = Union{T,NAtype}

So for subtypes of AbstractDataArray, the T is not referring to union but for signatures using AbstractArray the T refers to the union. This requires a little care when we ping-ping between Base and DataArrays definitions such as broadcast and similar. For that, it was convenient to define an extractT function to return the T part of a Union{T,NAtype}.

The changes here also require some changes to DataFrames. Hopefully, I'll be able to open a PR with them tomorrow and then we can look at the whole picture.

cc: @iamed2, @JeffBezanson

Update: Realized that an eltype definition is not needed here anymore since the fallback for AbstractArrays would now be correct.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That's definitely more correct, though I wonder how breaking this will be for users. Do you have any ideas about that?

FWIW, this change will make DataArray very similar to Array{Union{T, Nulls.Null}. If we make this relatively deep change, maybe we should also replace NAtype with Nulls.Null?

@@ -4,7 +4,7 @@
An `N`-dimensional `AbstractArray` whose entries can take on values of type
`T` or the value `NA`.
"""
abstract type AbstractDataArray{T, N} <: AbstractArray{T, N} end
abstract type AbstractDataArray{T, N} <: AbstractArray{Union{T,NAtype}, N} end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spacing rules are getting too complex. Either use spaces after commas for all type parameters, or for none, but not just outside of unions.


# This is mainly to handle isna.(x) since isna is probably the only
# function that can guarantee that NAs will never propagate
@inline function broadcast_t(f, ::Type{Bool}, shape, A, Bs...)
Copy link
Member

@nalimilan nalimilan Jul 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is type piracy. I would have expected Base to do this automatically already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because the Base method is not extended. I just use the same name. See the comment above broadcast_c for an explanation. Base's broadcast needs some adjustments before we can use it here.

@@ -244,8 +244,6 @@ end

dropna(dv::DataVector) = dv.data[.!dv.na] # -> Vector

Base.broadcast(::typeof(isna), da::DataArray) = copy(da.na)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it worth keeping this specialization for performance? Remember that we have deprecated isna(da) in favor of isna.(da).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to clean up the code. I'm not sure how big the performance difference is but even if there is a slight difference, I'm not sure if it really matters here. Having these kinds of definitions is misleading and I wasted some time because of it since isna.(x) and (!).(isna.(x)) were returning different types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then at least measure the performance difference? isna is quite commonly used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair call. It is indeed much faster. I'll add it back but in the broadcast.jl file instead of here.

src/natype.jl Outdated
@@ -36,12 +36,26 @@ struct NAException <: Exception
end
NAException() = NAException("NA found")

Base.promote_rule(::Type{Union{T,NAtype}}, ::Type{Union{S,NAtype}}) where {T<:Number,S<:Number} =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose these are only to fix ambiguities? If that's the case, better add a comment. If not, then why not support more general promotions? Same remark for convert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The restriction to Number was necessary. Otherwise, it kept calling itself. I can add a comment.

Copy link
Member Author

@andreasnoack andreasnoack Jul 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convert method is just to define convert in that case, not so fix an ambiguity. You are probably thinking of the Number again so same reply as above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explains why you wrote it that way, but doesn't justify why we'd want to support only Number. We should find a solution for other types. Nulls.jl only defines methods for Null (equivalent of NAtype): couldn't we do the same here?

pool::Vector{T},
m::AbstractArray{Bool, N},
m::Union{AbstractArray{Bool, N},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be written more simply as <:Union{Bool, NAtype}. You could also adopt the same approach for d and use T = extractT(eltype(d)) inside the function body.

pna = 100nacount/length(X)
if pna != 100 # describe will fail if dropna returns an empty vector
describe(io, dropna(X))
else
println(io, "Summary Stats:")
println(io, "Type: $(eltype(X))")
println(io, "Type: $(T)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use extractT(eltype(X)) just like below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type parameter is already in the signature so I think it is easier and clearer just to use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I had misread AbstractDataArray as AbstractArray.

test/nas.jl Outdated
@testset "promotion" begin
@test promote_type(Int, Union{Float64,NAtype}) == Union{Float64,NAtype}
@test promote_type(Union{Int,NAtype}, Float64) == Union{Float64,NAtype}
@test promote_type(Union{Int,NAtype}, Union{Float64,NAtype}) == Union{Float64,NAtype}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also tests with non-number types, e.g. String.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there are any promotion methods for String.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why define methods for Number and not for other types? That was my main interrogation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess you meant that it does not really make sense to promote String with anything else? Then you could try with e.g. Dates.Minute and Dates.Second.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to define if for any type but Union{Any,NAtype}==Any so that doesn't work. Basically, I don't know how we can cover all relevant cases except by adding them one by one. Number is probably the most category to cover and the only one covered by tests but I think you are right that Dates should also have a definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't noticed, but probably because I use my own custom promoting machinery in CSV.jl (the one place I rely on promoting, as far as I know). I need my own rules, because I want things like promote_type(::Type{Int}, ::Type{Date}) = String.

I did add the definition promote_type2{T, S}(::Type{Union{T, Null}}, ::Type{Union{S, Null}}) = Union{promote_type2(T, S), Null}; would that be enough to solve the case here?

Copy link
Member Author

@andreasnoack andreasnoack Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your last definition is what I tried first (with pomote_rule) but I got an infinite recursion out of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work here. Can you post exactly what you tried?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so

promote_rule(::Type{Union{T,NAtype}}, ::Type{Union{S,NAtype}}) where {T<:Dates.AbstractTime,S<:Dates.AbstractTime} = Union{promote_type(T, S),NAtype}

seems fine but

promote_rule(::Type{Union{T,NAtype}}, ::Type{S}) where {T,S} = Union{promote_type(T, S),NAtype}

causes stack overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to define Base.promote_rule(::Type{Any}, ::Type) = Any? Maybe this should be added to Base, and shipped here temporarily with a VERSION check?

@@ -128,4 +128,9 @@
@test map!(abs, x, x) == @data([1, 2])
@test isequal(map!(+, DataArray(Float64, 3), @data([1, NA, 3]), @data([NA, 2, 3])), @data([NA, NA, 6]))
@test map!(isequal, DataArray(Float64, 3), @data([1, NA, NA]), @data([1, NA, 3])) == @data([true, true, false])

# isna doesn't propagate NAs so it should return BitArrays
x = (!).(isna.(@data [NA, 1, 2]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also test without (!).? The code paths used are quite different due to fusion.

@andreasnoack andreasnoack force-pushed the anj/eltype branch 2 times, most recently from bb2e425 to c70866e Compare July 30, 2017 18:32
@andreasnoack
Copy link
Member Author

Thanks for the comments. I'm not sure how it will feel like in user code. We'll probably have to try it out a bit. I made the changes to DataFrames to get a feel for how much would have to be changed in library code. I don't think it was that bad and it will hopefully open up for using DataArrays in function not written explicitly for them.

My understanding is that Null is basically a reimplementation of NAtype with the main benefit that it lives in a separate package instead of being bundled with two array types. I can see that the semantics for comparisons are a bit different from NA. I could try making a commit with that change but I'm not sure it is worth bundling the two changes into the same PR.

@ararslan
Copy link
Member

I could try making a commit with that change but I'm not sure it is worth bundling the two changes into the same PR.

👍 I think we should go ahead with this and try out the Nulls change in a separate branch. Luckily since the name is different (null vs NA), we should be able to put some deprecations in place that will ensure people make the right comparisons for null.

@nalimilan
Copy link
Member

Sure, I didn't mean we should move to Nulls.jl in this PR, just that it could make sense to make both changes before tagging a release (if we decide to do both).

@ararslan
Copy link
Member

Bump.

@@ -4,7 +4,7 @@
An `N`-dimensional `AbstractArray` whose entries can take on values of type
`T` or the value `NA`.
"""
abstract type AbstractDataArray{T, N} <: AbstractArray{T, N} end
abstract type AbstractDataArray{T, N} <: AbstractArray{Union{T, NAtype}, N} end
Copy link
Member

@rofinn rofinn Aug 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe add an alias (e.g., const Data{T} = Union{T, NAtype}) and not export it? Reading code that operates on DataArrays.Data{T} seems more readable to me and it might help with deprecations later on.

@andreasnoack
Copy link
Member Author

Updated. The test failure on master is most likely because of the macrocalypse going on on master right now. Hopefully, it'll either be fixed or will betold how to fix DataArrays in a couple of days.

@andreasnoack andreasnoack changed the title WIP: Stop lying about eltype RFC: Stop lying about eltype Aug 19, 2017
@andreasnoack
Copy link
Member Author

I think this is ready but right now it causes a segfault on master

@andreasnoack andreasnoack merged commit 8b9e896 into master Aug 20, 2017
@andreasnoack andreasnoack deleted the anj/eltype branch August 20, 2017 01:56
@rofinn
Copy link
Member

rofinn commented Aug 20, 2017

Hooray, thanks @andreasnoack!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants