-
Notifications
You must be signed in to change notification settings - Fork 50
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That's definitely more correct, though I wonder how breaking this will be for users. Do you have any ideas about that?
FWIW, this change will make DataArray
very similar to Array{Union{T, Nulls.Null}
. If we make this relatively deep change, maybe we should also replace NAtype
with Nulls.Null
?
src/abstractdataarray.jl
Outdated
@@ -4,7 +4,7 @@ | |||
An `N`-dimensional `AbstractArray` whose entries can take on values of type | |||
`T` or the value `NA`. | |||
""" | |||
abstract type AbstractDataArray{T, N} <: AbstractArray{T, N} end | |||
abstract type AbstractDataArray{T, N} <: AbstractArray{Union{T,NAtype}, N} end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spacing rules are getting too complex. Either use spaces after commas for all type parameters, or for none, but not just outside of unions.
|
||
# This is mainly to handle isna.(x) since isna is probably the only | ||
# function that can guarantee that NAs will never propagate | ||
@inline function broadcast_t(f, ::Type{Bool}, shape, A, Bs...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is type piracy. I would have expected Base to do this automatically already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because the Base method is not extended. I just use the same name. See the comment above broadcast_c
for an explanation. Base's broadcast needs some adjustments before we can use it here.
@@ -244,8 +244,6 @@ end | |||
|
|||
dropna(dv::DataVector) = dv.data[.!dv.na] # -> Vector | |||
|
|||
Base.broadcast(::typeof(isna), da::DataArray) = copy(da.na) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it worth keeping this specialization for performance? Remember that we have deprecated isna(da)
in favor of isna.(da)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to clean up the code. I'm not sure how big the performance difference is but even if there is a slight difference, I'm not sure if it really matters here. Having these kinds of definitions is misleading and I wasted some time because of it since isna.(x)
and (!).(isna.(x))
were returning different types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then at least measure the performance difference? isna
is quite commonly used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair call. It is indeed much faster. I'll add it back but in the broadcast.jl
file instead of here.
src/natype.jl
Outdated
@@ -36,12 +36,26 @@ struct NAException <: Exception | |||
end | |||
NAException() = NAException("NA found") | |||
|
|||
Base.promote_rule(::Type{Union{T,NAtype}}, ::Type{Union{S,NAtype}}) where {T<:Number,S<:Number} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose these are only to fix ambiguities? If that's the case, better add a comment. If not, then why not support more general promotions? Same remark for convert
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The restriction to Number
was necessary. Otherwise, it kept calling itself. I can add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The You are probably thinking of the convert
method is just to define convert in that case, not so fix an ambiguity.Number
again so same reply as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That explains why you wrote it that way, but doesn't justify why we'd want to support only Number
. We should find a solution for other types. Nulls.jl only defines methods for Null
(equivalent of NAtype
): couldn't we do the same here?
src/pooleddataarray.jl
Outdated
pool::Vector{T}, | ||
m::AbstractArray{Bool, N}, | ||
m::Union{AbstractArray{Bool, N}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be written more simply as <:Union{Bool, NAtype}
. You could also adopt the same approach for d
and use T = extractT(eltype(d))
inside the function body.
pna = 100nacount/length(X) | ||
if pna != 100 # describe will fail if dropna returns an empty vector | ||
describe(io, dropna(X)) | ||
else | ||
println(io, "Summary Stats:") | ||
println(io, "Type: $(eltype(X))") | ||
println(io, "Type: $(T)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use extractT(eltype(X))
just like below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type parameter is already in the signature so I think it is easier and clearer just to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I had misread AbstractDataArray
as AbstractArray
.
test/nas.jl
Outdated
@testset "promotion" begin | ||
@test promote_type(Int, Union{Float64,NAtype}) == Union{Float64,NAtype} | ||
@test promote_type(Union{Int,NAtype}, Float64) == Union{Float64,NAtype} | ||
@test promote_type(Union{Int,NAtype}, Union{Float64,NAtype}) == Union{Float64,NAtype} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also tests with non-number types, e.g. String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there are any promotion methods for String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why define methods for Number and not for other types? That was my main interrogation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess you meant that it does not really make sense to promote String
with anything else? Then you could try with e.g. Dates.Minute
and Dates.Second
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to define if for any type but Union{Any,NAtype}==Any
so that doesn't work. Basically, I don't know how we can cover all relevant cases except by adding them one by one. Number
is probably the most category to cover and the only one covered by tests but I think you are right that Dates
should also have a definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't noticed, but probably because I use my own custom promoting machinery in CSV.jl (the one place I rely on promoting, as far as I know). I need my own rules, because I want things like promote_type(::Type{Int}, ::Type{Date}) = String
.
I did add the definition promote_type2{T, S}(::Type{Union{T, Null}}, ::Type{Union{S, Null}}) = Union{promote_type2(T, S), Null}
; would that be enough to solve the case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your last definition is what I tried first (with pomote_rule
) but I got an infinite recursion out of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to work here. Can you post exactly what you tried?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so
promote_rule(::Type{Union{T,NAtype}}, ::Type{Union{S,NAtype}}) where {T<:Dates.AbstractTime,S<:Dates.AbstractTime} = Union{promote_type(T, S),NAtype}
seems fine but
promote_rule(::Type{Union{T,NAtype}}, ::Type{S}) where {T,S} = Union{promote_type(T, S),NAtype}
causes stack overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to define Base.promote_rule(::Type{Any}, ::Type) = Any
? Maybe this should be added to Base, and shipped here temporarily with a VERSION
check?
test/broadcast.jl
Outdated
@@ -128,4 +128,9 @@ | |||
@test map!(abs, x, x) == @data([1, 2]) | |||
@test isequal(map!(+, DataArray(Float64, 3), @data([1, NA, 3]), @data([NA, 2, 3])), @data([NA, NA, 6])) | |||
@test map!(isequal, DataArray(Float64, 3), @data([1, NA, NA]), @data([1, NA, 3])) == @data([true, true, false]) | |||
|
|||
# isna doesn't propagate NAs so it should return BitArrays | |||
x = (!).(isna.(@data [NA, 1, 2])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also test without (!).
? The code paths used are quite different due to fusion.
bb2e425
to
c70866e
Compare
Thanks for the comments. I'm not sure how it will feel like in user code. We'll probably have to try it out a bit. I made the changes to My understanding is that |
👍 I think we should go ahead with this and try out the Nulls change in a separate branch. Luckily since the name is different ( |
Sure, I didn't mean we should move to Nulls.jl in this PR, just that it could make sense to make both changes before tagging a release (if we decide to do both). |
c70866e
to
604385d
Compare
Bump. |
src/abstractdataarray.jl
Outdated
@@ -4,7 +4,7 @@ | |||
An `N`-dimensional `AbstractArray` whose entries can take on values of type | |||
`T` or the value `NA`. | |||
""" | |||
abstract type AbstractDataArray{T, N} <: AbstractArray{T, N} end | |||
abstract type AbstractDataArray{T, N} <: AbstractArray{Union{T, NAtype}, N} end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe add an alias (e.g., const Data{T} = Union{T, NAtype}
) and not export it? Reading code that operates on DataArrays.Data{T}
seems more readable to me and it might help with deprecations later on.
Updated. The test failure on master is most likely because of the macrocalypse going on on master right now. Hopefully, it'll either be fixed or will betold how to fix |
f31e879
to
d7897ef
Compare
I think this is ready but right now it causes a segfault on master |
Hooray, thanks @andreasnoack! |
As discussed on Slack. It was actually not that bad. The core changes are in
abstractdataarray.jl
and definesSo for subtypes of
AbstractDataArray
, theT
is not referring to union but for signatures usingAbstractArray
theT
refers to the union. This requires a little care when we ping-ping between Base and DataArrays definitions such asbroadcast
andsimilar
. For that, it was convenient to define anextractT
function to return theT
part of aUnion{T,NAtype}
.The changes here also require some changes to
DataFrames
. Hopefully, I'll be able to open a PR with them tomorrow and then we can look at the whole picture.cc: @iamed2, @JeffBezanson
Update: Realized that an
eltype
definition is not needed here anymore since the fallback forAbstractArray
s would now be correct.