Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing and Missing #24653

Merged
merged 6 commits into from
Dec 5, 2017
Merged

Add missing and Missing #24653

merged 6 commits into from
Dec 5, 2017

Conversation

nalimilan
Copy link
Member

Add basic support with special methods for operators and standard math functions. Adapt ==(::AbstractArray, ::AbstractArray), all() and any() to support three-valued logic. This requires defining missing and Missing early in the bootstrap process, but other missing-related code is included relatively late
to be able to add methods to Base functions defined in various places.

Hopefully the compiler will be able to optimize out the three-valued logic-related code for types which cannot contain missing. Benchmarks at JuliaCI/BaseBenchmarks.jl#138 should help checking that it's the case.

Subsequent PRs will be needed to add convenience functions like nonmissing/skipmissing and coalesce (see JuliaData/Missings.jl#53).

@@ -155,7 +155,7 @@ julia> "foo" ≠ "foo"
false
```
"""
!=(x, y) = !(x == y)::Bool
!=(x, y) = !(x == y)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine there was a reason for the presence of this type assertion, so how problematic is it to remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best thing to do is try it and see what the impact on compiler performance is. Based on the motivation for #20792, it also might make sense just not to run inference on deprecated methods (that was already necessary for one method in #23273).

@test promote_type(Union{Int, Missing}, Union{Int, Missing}) == Union{Int, Missing}
@test promote_type(Union{Float64, Missing}, Union{String, Missing}) == Any
@test promote_type(Union{Float64, Missing}, Union{Int, Missing}) == Union{Float64, Missing}
@test_broken promote_type(Union{Void, Missing, Int}, Float64) == Any
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand whether this bug is in this PR or reflects a deeper Julia bug. See JuliaData/Missings.jl#47.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeffBezanson , @vtjnash, it'd be great to have you take a look at this

base/missing.jl Outdated
showerror(io::IO, ex::MissingException) =
print(io, "MissingException: ", ex.msg)

ismissing(::Any) = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it required here given it's already in essentials.jl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just realized that I had missed the warning.

end

@testset "any & all" begin
@test any([true, missing])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the result of any(x)/all(x) be still inferred as Bool when x cannot contain missing values? (IMO it would be a nice thing to have)
If so, maybe it could be tested here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I've added @inferred everywhere it makes sense to test/reduce.jl rather than to this file, since it's more logical to test the behavior where type stability is expected. Also, benchmarks should should be pretty good at catching this kind of issue.

@nalimilan nalimilan force-pushed the nl/missing branch 2 times, most recently from 20a8f13 to b207fc8 Compare November 19, 2017 22:02
@davidanthoff
Copy link
Contributor

I don't think this is ready to be merged into Base. In my mind we should hold off putting something into Base until we have found a solution that has no known serious problems with important packages in the data sphere. That is not the case for Union{T,Missing} at this point, in my opinion.

In particular, even after half a year of folks thinking about this, there is no strategy how Union{T,Missing} could be used in Query.jl. If Query.jl can't be made to use Union{T,Missing}, it implies that everything in the IterableTables.jl world also can't use it. That affects a lot of packages, if you take a look at the source and sink support here it would essentially mean that what is currently the most comprehensive interop story for tabular data in julia will have to use something other than the Base story for missing data.

I know there is a lot of enthusiasm for the Union{T,Missing} approach, but I think it is important to acknowledge that is has not emerged as a solution that can be used throughout the data ecosystem. It clearly works great for some packages, but it does not seem to work for other packages at all. I don't think something should be put into Base at that stage.

For those that aren't familiar with the details of the problem, take a look at JuliaData/Missings.jl#6.

@nalimilan and @quinnj What is your thinking about the Query.jl situation? Aren't we heading straight for a complete split data ecosystem with this strategy? Is that good?

@nalimilan
Copy link
Member Author

@davidanthoff Can we keep that discussion at JuliaData/Missings.jl#6, where it has happened from the beginning? Mixing the technical review with more general design issues regarding Query isn't going to give a good result.

@@ -112,7 +112,7 @@ function reducedim_init(f, op::typeof(*), A::AbstractArray, region)
end
function _reducedim_init(f, op, fv, fop, A, region)
T = promote_union(eltype(A))
if applicable(zero, T)
if T !== Any && applicable(zero, T)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one isn't totally satisfying, but I had to add it due to the zero(::Type{Any}) = throw(MethodError(zero, (Any,))) method, which is in turn needed to replace a StackOverflowError with a more explicit error. I wonder whether applicable should treat methods that throw unconditionally as if they didn't exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether applicable should treat methods that throw unconditionally as if they didn't exist.

That would be good, although probably just the ones which throw a MethodError (or some other special error, e.g. NotImplementedError).

@davidanthoff
Copy link
Contributor

@davidanthoff Can we keep that discussion at JuliaData/Missings.jl#6, where it has happened from the beginning? Mixing the technical review with more general design issues regarding Query isn't going to give a good result.

Sure. @ViralBShah had asked me to raise this issue here. Where do you want discussion about the overall merits of moving this into Base? Over at https://github.com/JuliaData/Roadmap.jl/issues/3?

@nalimilan
Copy link
Member Author

Where do you want discussion about the overall merits of moving this into Base? Over at JuliaData/Roadmap.jl#3?

Why not. Or file an issue in JuliaLang.

@nalimilan
Copy link
Member Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@ararslan
Copy link
Member

Man, Nanosoldier is being weird. I just restarted the server, so @nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@nalimilan
Copy link
Member Author

Nanosoldier results are mostly noise. Times appear to have improved (!) for a lot of benchmarks, which doesn't seem plausible, and to have regressed for three rand! benchmarks from Complex as well as for append!, which are unrelated.

However, the any(::Array{Float64,1}) benchmark shows a 16% regression which could be legitimate. What is surprising is that it does not happen for other types, including Float32 and integers, nor for all. Overall 16% only for one type isn't too bad, but let's try another time to confirm.

@nanosoldier runbenchmarks("array", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@nalimilan
Copy link
Member Author

The new run is more credible. Apart from a few regressions which are different from the last run and which don't seem related, there are two regressions and one improvement for any/all benchmarks. Looking in detail at these 24 benchmarks locally, they are all very stable (variation below 2%) except for these three ones. Since they are different from the previous run, and since there is no reason for results to differ between similar types, I think we can conclude these are spurious.

Finally, the benchmarks for array equality haven't appeared in any of the two runs, so I think we can safely conclude there is no regression either.

Overall we should be good to go.

@StefanKarpinski
Copy link
Member

Who would be best to review this? Let's get it done soon so this doesn't get too conflicted.


promote_rule(::Type{Missing}, ::Type{T}) where {T} = Union{T, Missing}
promote_rule(::Type{Union{S,Missing}}, ::Type{T}) where {T,S} = Union{promote_type(T, S), Missing}
promote_rule(::Type{Any}, ::Type{T}) where {T} = Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone have any qualms about this definition? Should it perhaps live in promotion.jl because of it's generalness?

base/missing.jl Outdated
@eval Math.$(f)(::Missing) = missing
end

zero(::Type{Union{T, Missing}}) where {T} = zero(T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should look back at why this definition was needed; I think it was so zeroes(Union{Int, Missing}, dims...) worked, but I wonder if @Sacha0 's work on removing zeroes will avoid the need for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's needed so that things like zero(eltype(Array{Union{Int, Missing}})) work, which happens frequently when a function takes this kind of array as an argument.

return false
end
end
return true
return anymissing ? missing : true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use ifelse here, similarly in reduce.jl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't make any difference for simple cases like this:

julia> using Missings

julia> f(x) = x ? missing : true
f (generic function with 1 method)

julia> g(x) = ifelse(x, missing, true)
g (generic function with 1 method)

julia> @code_llvm f(true)

; Function f
; Location: REPL[1]:1
define nonnull %jl_value_t addrspace(10)* @julia_f_63991(i8) {
top:
  %1 = and i8 %0, 1
  %2 = icmp eq i8 %1, 0
  %. = select i1 %2, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139748733701392 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139748736339184 to %jl_value_t*) to %jl_value_t addrspace(10)*)
  ret %jl_value_t addrspace(10)* %.
}

julia> @code_llvm g(true)

; Function g
; Location: REPL[2]:1
define nonnull %jl_value_t addrspace(10)* @julia_g_63993(i8) {
top:
  %1 = and i8 %0, 1
  %2 = icmp eq i8 %1, 0
  %3 = select i1 %2, %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139748733701392 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 139748736339184 to %jl_value_t*) to %jl_value_t addrspace(10)*)
  ret %jl_value_t addrspace(10)* %3
}

@test promote_type(Union{Int, Missing}, Union{Int, Missing}) == Union{Int, Missing}
@test promote_type(Union{Float64, Missing}, Union{String, Missing}) == Any
@test promote_type(Union{Float64, Missing}, Union{Int, Missing}) == Union{Float64, Missing}
@test_broken promote_type(Union{Void, Missing, Int}, Float64) == Any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JeffBezanson , @vtjnash, it'd be great to have you take a look at this

@ararslan ararslan added the missing data Base.missing and related functionality label Nov 28, 2017
value::Int
end
Base.zero(::Type{Unit}) = Unit(0)
Base.one(::Type{Unit}) = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was best to not rely on the stdlib Dates to make these tests work. If others feel differently I can change this.

test/missing.jl Outdated
@test_throws MethodError zero(Any)
@test_throws MethodError zero(String)
@test_throws MethodError zero(Union{String, Missing})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would still be good to have AFAICT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will only apply to zero

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it throw the same error for one and oneunit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one(String) == "" because "foo" * "" == "foo"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought we had decided against it. Anyway, you can use Symbol instead of String.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also have to add a special-case for the new methods in the ambiguous test (which explains the failures).

nalimilan and others added 2 commits December 2, 2017 15:04
Add basic support with special methods for operators and standard math functions.
Adapt ==(::AbstractArray, ::AbstractArray), all() and any() to support
three-valued logic. This requires defining missing and Missing early in the
bootstrap process, but other missing-related code is included relatively late
to be able to add methods to Base functions defined in various places.
@nalimilan
Copy link
Member Author

I've fixed the tests and added a section of the manual detailing how missing works, as well as a NEWS entry and noteworthy differences from R.

I'll merge in a few days barring any objections. Then I'll open another PR to discuss adding missing-related functions to the public API.

@StefanKarpinski StefanKarpinski changed the title WIP: Add missing and Missing Add missing and Missing Dec 3, 2017
@nalimilan nalimilan merged commit 9b52fc0 into master Dec 5, 2017
@nalimilan nalimilan deleted the nl/missing branch December 5, 2017 18:14
evetion pushed a commit to evetion/julia that referenced this pull request Dec 12, 2017
Add basic support with special methods for operators and standard math functions.
Adapt ==(::AbstractArray, ::AbstractArray), all() and any() to support
three-valued logic. This requires defining missing and Missing early in the
bootstrap process, but other missing-related code is included relatively late
to be able to add methods to Base functions defined in various places.
Add new manual section about missing values.
@nalimilan nalimilan restored the nl/missing branch June 30, 2018 10:02
@nalimilan nalimilan deleted the nl/missing branch June 30, 2018 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.