Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mean of an empty collection #28777

Open
bkamins opened this issue Aug 20, 2018 · 12 comments
Open

Mean of an empty collection #28777

bkamins opened this issue Aug 20, 2018 · 12 comments
Labels
fold sum, maximum, reduce, foldl, etc. missing data Base.missing and related functionality

Comments

@bkamins
Copy link
Member

bkamins commented Aug 20, 2018

It seems that this issue must have been discussed, but I could not find any reference to it so I open it.

mean behaves inconsistently between array and collection implementations in empty container case. This is especially relevant since missing was introduced into base. Here is a code that shows the problem:

julia> using Statistics

julia> x = Union{Float64,Missing}[missing]
1-element Array{Union{Missing, Float64},1}:
 missing

julia> mean(skipmissing(x))
ERROR: ArgumentError: mean of empty collection undefined: Base.SkipMissing{Array{Union{Missing, Float64},1}}(Union{Missing, Float64}[missing])

julia> mean(collect(skipmissing(x)))
NaN

Maybe a specialized implementation of mean in case when eltype of the collection is specified should be implemented? This is the case of skipmissing which knows that its eltype is Float64. For instance sum handles this case correctly (there is a special mapreduce for this case with skipmissing). Probably more functions in Statistics should be reviewed, but first I would want to understand what is the intention with mean.

@JeffreySarnoff
Copy link
Contributor

same with cor cov std stdm var varm

btw cov(collect(skipmissing(x))) === -0.0

@bkamins
Copy link
Member Author

bkamins commented Sep 3, 2018

I can fix it, but I am not sure what do we want. Should empty collections always throw an error, or they should produce NaN if element type of the collection is <:Number? It seems that the errors were consciously hard coded>

Examples of current inconsistencies (following what was noted by @JeffreySarnoff ):

  • cov(Int[]) === -0.0
  • var(Int[]) === NaN
  • var(skipmissing(Int[])) is an argument error
  • cov(skipmissing(Int[])) is a method error

CC @nalimilan

@nalimilan
Copy link
Member

I agree that reductions over skipmissing(::Array{T}) should return the same value as reductions over ::Array{Missings.T(T)}. Since we can't change the return value of functions which currently don't throw, I guess we should just copy whatever behavior is currently implemented. If it's inconsistent, let's track that in a separate issue for 2.0.

Relevant: #5234

@nalimilan nalimilan added the missing data Base.missing and related functionality label Sep 3, 2018
@gdkrmr
Copy link
Contributor

gdkrmr commented Sep 4, 2018

here are some more for completeness:

julia> y = Union{Missing, Float64}[missing, missing, missing]

julia> mean(skipmissing(y))
ERROR: ArgumentError: mean of empty collection undefined: Base.SkipMissing{Array{Union{Missing, Float64},1}}(Union{Missing, Float64}[missing, missing, missing])

julia> mean(Float64[])
NaN

julia> mean(Int64[])
NaN

julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})

@bkamins
Copy link
Member Author

bkamins commented Sep 4, 2018

@gdkrmr: following what @nalimilan suggested currently we will fix mean(skipmissing(y)) case only because it can be considered a bug. Other scenarios will have to wait for Julia 2.0 for unification.

I will keep this issue open for Julia 2.0 changes discussion and open a separate PR for the short term fix.

@bkamins
Copy link
Member Author

bkamins commented Sep 4, 2018

In #29033 I have proposed the changes that are minimally breaking (because sometimes we will not throw an error now, but we did in the past) and consistent with the return value of collect(itr) if itr is an empty iterator.

@nolta
Copy link
Member

nolta commented Apr 19, 2019

After #29033, the mean of an empty tuple generates an inscrutable error:

julia> Statistics.mean(())
ERROR: MethodError: Base.reduce_empty(::typeof(Base.add_sum), ::Type{Union{}}) is ambiguous. Candidates:
  reduce_empty(::typeof(Base.add_sum), ::Type{T}) where T<:Union{UInt16, UInt32, UInt8} in Base at reduce.jl:236
  reduce_empty(::typeof(Base.add_sum), ::Type{T}) where T<:Union{Int16, Int32, Int8} in Base at reduce.jl:235
Possible fix, define
  reduce_empty(::typeof(Base.add_sum), ::Type{Union{}})

The previous error was better:

julia> Statistics.mean(())
ERROR: ArgumentError: mean of empty collection undefined: ()

@bkamins
Copy link
Member Author

bkamins commented Apr 19, 2019

Good point, but I think it should be fixed in Base. See:

julia> reduce(*, ())
""

which is surprising
and

julia> reduce(+, ())
ERROR: MethodError: zero(::Type{Union{}}) is ambiguous. Candidates:
  zero(::Union{Type{P}, P}) where P<:Dates.Period in Dates at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.1\Dates\src\periods.jl:50
  zero(::Type{T}) where T<:Number in Base at number.jl:263
Possible fix, define
  zero(::Type{Union{}})

which is the same error.

The core of the problem is reduce_empty implementation, see:

julia> Base.reduce_empty(*, Union{})
""

julia> Base.reduce_empty(+, Union{})
ERROR: MethodError: zero(::Type{Union{}}) is ambiguous. Candidates:
  zero(::Union{Type{P}, P}) where P<:Dates.Period in Dates at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.1\Dates\src\periods.jl:50
  zero(::Type{T}) where T<:Number in Base at number.jl:263
Possible fix, define
  zero(::Type{Union{}})

I think we should add reduce_empty methods for +, *, add_sum and mul_prod as a first argument and Union{} as a second argument to throw an error.

@originalsouth
Copy link
Contributor

Another one for the list:

julia> median(Float64[])
ERROR: ArgumentError: median of an empty array is undefined, Float64[]

@kongdd
Copy link

kongdd commented Apr 1, 2022

quantile still has this bug.

y = Union{Missing, Float64}[missing, missing, missing]
quantile(y)

julia> ERROR: MethodError: no method matching quantile(::Vector{Union{Missing, Float64}})
Closest candidates are:
  quantile(::AbstractVector{V}, ::StatsBase.AbstractWeights{W, T} where T<:Real, ::AbstractVector{T} where T<:Real) where {V, W<:Real} at ~/.julia/packages/StatsBase/pJqvO/src/weights.jl:687
  quantile(::AbstractVector, ::Any; sorted, alpha, beta) at /opt/julias/julia-1.7/share/julia/stdlib/v1.7/Statistics/src/Statistics.jl:1073
  quantile(::Any, ::Any; sorted, alpha, beta) at /opt/julias/julia-1.7/share/julia/stdlib/v1.7/Statistics/src/Statistics.jl:1070
  ...
Stacktrace:
 [1] top-level scope
   @ REPL[33]:1

@ignace-computing
Copy link

Hallo.
Is the folllowing behavior, in Julia 1.10.4, meant to be?

julia> mean(Float64[])
NaN

julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})

@tpapp
Copy link
Contributor

tpapp commented Jul 31, 2024

@ignace-computing: probably not, but this is a corner case that is just not consistently handled at the moment.

IMO the reasonable solution would be throwing an error for all empty collections. Otherwise you just get insidious bugs (though NaN propagation is a pretty mild case as you can track it down relatively easily).

@mbauman mbauman added the fold sum, maximum, reduce, foldl, etc. label Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc. missing data Base.missing and related functionality
Projects
None yet
Development

No branches or pull requests

10 participants